#### Author
Yury Kashnitsky

#### Reference
[Notion ticket](https://www.notion.so/a74951e4e815480584dea7d61ddce6cc?v=dbfdb1207d0e451b827d3c5041ed0cfd&p=5f341a7320184677a11708a109dfff20)

#### Idea
Fix folds for the ongoing and follow-up experiments Run the baseline Tf-Idf & logreg model with these folds.

#### Data
4500 cryptonews titles labeled as positive, neutral or negative – zipped pwd-protected [CSV](https://drive.google.com/file/d/1Apr3YPZVf0kOJ5Pc1RYDoQxTdjJPbnt4/view?usp=sharing) (not to be shared outside of the project!)

#### Result
`StratifiedKFold(n_splits=5, shuffle=True, random_state=17)` in general needs to be used. 
Additionally, file `../data/folds.csv` saves title ids mapped to CV split ids (from 0 to 4). 

In [1]:
import numpy as np
import pandas as pd

from sklearn.model_selection import cross_val_score, StratifiedKFold
from scipy.sparse import hstack
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import learning_curve

from matplotlib import pyplot as plt
import seaborn as sns; sns.set()               # better style
%config InlineBackend.figure_format = 'retina' # sharper plots

## Reading the data

In [2]:
PATH_TO_DATA = '../data/20190110_train_4500.csv'

In [3]:
train_df = pd.read_csv(PATH_TO_DATA)

In [4]:
train_df.head()

Unnamed: 0,title,sentiment
0,Bitcoin Market Has Run Out of Juice: Cryptocur...,Negative
1,Bitcoin Core 0.14.0 Speeds Up Blockchain Synci...,Positive
2,Thinking of Travelling With Bitcoin? With Thes...,Positive
3,Investors Carried Out Mental Gymnastics to Jus...,Negative
4,"Bitcoin Price Holds Above $8,500 as Market Fig...",Positive


In [5]:
train_df['sentiment'].value_counts(normalize=True)

Positive    0.463329
Negative    0.330259
Neutral     0.206412
Name: sentiment, dtype: float64

## Model

Params are tuned previously. 

In [6]:
title_transformer = TfidfVectorizer(ngram_range=(1, 5), 
                                    min_df=8, 
                                    analyzer='char', 
                                    max_features=100000,
                                    stop_words='english')
logit = LogisticRegression(C=2.7, 
                           random_state=17, 
                           multi_class='multinomial', 
                           solver='lbfgs', 
                           n_jobs=4, 
                           max_iter=500)

model = Pipeline([('tfidf', title_transformer), ('logit', logit)])

## Running cross-validation

Prepare folds

In [7]:
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=17)

In [8]:
train_df['split'] = 0

In [9]:
for fold_id, (train_ids, val_ids) in enumerate(skf.split(X=train_df['title'], y=train_df['sentiment'])):
    train_df.loc[val_ids, 'split'] = fold_id

In [10]:
train_df.head(2)

Unnamed: 0,title,sentiment,split
0,Bitcoin Market Has Run Out of Juice: Cryptocur...,Negative,2
1,Bitcoin Core 0.14.0 Speeds Up Blockchain Synci...,Positive,4


In [11]:
train_df['split'].to_csv('../data/folds.csv', index_label='id')

Run cross-validation

In [12]:
%%time
cv_results = cross_val_score(estimator=model, 
                             X=train_df['title'], 
                             y=train_df['sentiment'], 
                             cv=skf,
                             n_jobs=5)

CPU times: user 38.1 ms, sys: 43.7 ms, total: 81.8 ms
Wall time: 6.22 s


In [13]:
cv_results, cv_results.mean()

(array([0.72228321, 0.7025247 , 0.71020856, 0.72996707, 0.74175824]),
 0.7213483552671259)