# Predict Newspapers from Newspaper Articles

You can find the Jupyter Notebook for this lesson here - run it on Binder here.

## The Dataset

The [climate-news-db](https://www.climate-news-db.com/) is a dataset of newspapers on climate change, created & maintained by [Data Science South](https://www.datasciencesouth.com/).

## Project Goals

We are given a newspaper article without knowing which newspaper it comes from.

We must predict which newspaper a newspaper article is from - this is a classification problem.  

## Project Plan

In this project, we will iterate through three pipelines.  Each iteration will build on top of the previous implementation.

### Iteration One

- split our data into development and holdout datasets,
- simple data cleaning and feature engineering,
- train & test a dummy model & random forest.

### Iteration Two

- refactor the first iteration,
- add cross validation.

### Iteration Three

- refactor the second iteration,
- add hyperparameter tuning.

## Exploratory Data Analysis

Our dataset here is a snapshot of the [climate-news-db](https://www.climate-news-db.com/) database, which is a SQLite database called `db.sqlite`:

In [1]:
import sqlite3
import pandas as pd

# setup custom options for display of column width - useful for the article body text
pd.set_option("display.max_columns", 24)
pd.set_option("display.max_colwidth", 40)
pd.set_option("display.width", None)

# establish a connection to the SQLite database
conn = sqlite3.connect("./data/db.sqlite")

# read the article table using pandas
articles = pd.read_sql_query("SELECT * from article", conn)

print("raw article data:")
print(articles.shape)
print(articles.columns)

raw article data:
(8534, 9)
Index(['id', 'body', 'headline', 'article_name', 'article_url',
       'date_published', 'date_uploaded', 'newspaper_id', 'article_length'],
      dtype='object')


We can use `pd.DataFrame.sample()` to show three random articles with their headline and body text:

In [2]:
samples = articles.sample(3)

for n in range(samples.shape[0]):
    sample = samples.iloc[n]
    print(sample['headline'])
    print(sample['body'][:240])
    print("")

2020 was tied for the hottest year ever recorded -- but the disasters fueled by climate change set it apart
Global average temperatures last year were tied for the hottest on record, capping what was also the planet's hottest decade ever recorded, according to new data analysis released Friday.The last six years are now the hottest six on record,

Is it climate change or global warming? How science and a secret memo shaped the answer
As director of the Yale Program on Climate Change Communication, Anthony Leiserowitz gets brought in to a lot of conversations about the topic. He shapes stories about it with other scientist for publication. He talks to CEOs and politician

Greenpeace warns of ‘dangerous temperatures’ for Tokyo, Beijing
Study shows hot weather is starting earlier and that more frequent heat waves are likely. Scorching temperatures are becoming much more frequent in cities across East Asia, an analysis from Greenpeace East Asia has found, with the environme



## Merging with Newspaper Metadata

Our articles dataset only refers to newspaper by an integer ID:

In [3]:
print(articles.columns)
print(articles['newspaper_id'])

Index(['id', 'body', 'headline', 'article_name', 'article_url',
       'date_published', 'date_uploaded', 'newspaper_id', 'article_length'],
      dtype='object')
0        3
1        3
2        3
3        3
4        3
        ..
8529    16
8530    16
8531    16
8532    16
8533    16
Name: newspaper_id, Length: 8534, dtype: int64


This integer ID is the primary key of a table called `newspaper` - we can use it to join the two tables together, giving us a variable `data` with columns from both tables:

In [4]:
newspapers = pd.read_sql_query("SELECT * from newspaper", conn)
data = pd.merge(articles, newspapers, left_on="newspaper_id", right_on="id")
data = data.rename({"fancy_name": "newspaper"}, axis=1)
assert data.shape[0] == articles.shape[0]
print("merged article and newspaper data:")
print(data.columns)

merged article and newspaper data:
Index(['id_x', 'body', 'headline', 'article_name', 'article_url',
       'date_published', 'date_uploaded', 'newspaper_id', 'article_length',
       'id_y', 'name', 'newspaper', 'newspaper_url', 'color', 'article_count',
       'average_article_length'],
      dtype='object')


## Split Out a Holdout Set

Before we do any data science work (including data exploration), we will split off a holdout set.

This holdout data will sit untouched for as long as possible - when we need it, we can use it to evaluate our model on data it didn't see during model development.

In [5]:
import pathlib
from sklearn.model_selection import train_test_split

home = pathlib.Path("./data")
home.mkdir(exist_ok=True)
dev, ho = train_test_split(
  data,
  test_size=0.2,
  random_state=42,
)
dev.to_parquet(home / "dev.parquet")
ho.to_parquet(home / "holdout.parquet")

### EDA on the Development Set

We continue our work using only the development dataset.

In supervised learning, the target reigns supreme.  Understanding the target is a primary concern during EDA.

We can look at how our target is distributed using the counts:

In [6]:
data = pd.read_parquet(home / "dev.parquet")
data = data[
    ["headline", "body", "article_url", "date_published", "newspaper", "newspaper_url"]
]
print(data["newspaper"].value_counts())

The Guardian              757
Stuff.co.nz               589
Sky News Australia        543
The New York Times        487
The Daily Mail            477
The BBC                   450
CNN                       430
The Washington Post       414
Al Jazeera                411
Fox News                  399
NewsHub.co.nz             358
The Independent           327
Deutsche Welle            324
The Atlantic              318
The Economist             293
The New Zealand Herald    250
Name: newspaper, dtype: int64


From the results above, we can see that our target is not equally distributed among newspapers.  

This finding leads us to choosing to use a stratified split during training later on.

## Iteration One

### Test Train Split

In [7]:
from sklearn.model_selection import train_test_split

train, test = train_test_split(
  data,
  test_size=0.2,
  random_state=42,
  stratify=data["newspaper"]
)

### Feature Engineering

For this first iteration of our pipeline, we will keep things simple at the feature engineering stage.

We transform our target newspaper column using label encoding - making our target column a column of integers:

In [8]:
from sklearn.preprocessing import LabelEncoder

target_encoder = LabelEncoder()
target_tr = target_encoder.fit_transform(train["newspaper"])
train['target'] = target_tr
print(train[["target", "newspaper"]])

      target            newspaper
1089       8              The BBC
1631       1                  CNN
5334       4        NewsHub.co.nz
8245      15  The Washington Post
5067      12      The Independent
...      ...                  ...
3991      11         The Guardian
1503       8              The BBC
3463       3             Fox News
277        0           Al Jazeera
2005       1                  CNN

[5461 rows x 2 columns]


We then create features using tf-idf (term frequency–inverse document frequency):

In [9]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(min_df=5, max_df=0.9, ngram_range=(1, 2))
features_tr = tfidf.fit_transform(train["body"])
print(features_tr.shape)

(5461, 139311)


### Transform the Test Set

In [10]:
target_te = target_encoder.transform(test["newspaper"])
features_te = tfidf.transform(test["body"])

### Baseline Model

Now we have both our target and features, we can train a simple dummy model as a baseline:

In [11]:
from sklearn.dummy import DummyClassifier
from sklearn.metrics import accuracy_score

mdl = DummyClassifier()
mdl.fit(features_tr, target_tr)
score_tr = mdl.score(features_tr, target_tr)
score_te = mdl.score(features_te, target_te)
print(f"{mdl} scores tr: {score_tr} te: {score_te}")

DummyClassifier() scores tr: 0.11096868705365318 te: 0.11054172767203514


```
DummyClassifier() scores tr: 0.1126167368613807 te: 0.11273792093704246
```

### Train Random Forest

Our first machine learning model is a random forest classifier:

In [12]:
from sklearn.ensemble import RandomForestClassifier

mdl = RandomForestClassifier(n_estimators=100, n_jobs=-1, random_state=42)
mdl.fit(features_tr, target_tr)
score_tr = mdl.score(features_tr, target_tr)
score_te = mdl.score(features_te, target_te)
print(f"{mdl} scores tr: {score_tr} te: {score_te}")

RandomForestClassifier(n_jobs=-1, random_state=42) scores tr: 1.0 te: 0.8089311859443631


```
RandomForestClassifier(n_jobs=-1, random_state=42) scores tr: 1.0 te: 0.7847730600292826
```

At this point, the experienced developer will see and note the repeated code for training the baseline dummy classifier and the random forest.

This repetition is a good candidate for refactoring into a function in our second iteration.

### Should We Look at the Holdout Set Now?

This point is the natural end of our first pipeline - we don't plan on adding anything until we start our second iteration.

We could grab our holdout set and evaluate our model generalization again - but it's probably not worth it at this stage.  Each time we evaluate on the holdout and iterate again, we risk starting to overfit to the holdout.

For this reason, we will not evaluate on the holdout here.

## Iteration Two

The results of our first pipeline proved promising - we can show that a naive implementation (no feature engineering, one model with no hyperparameter tuning) can learn from our dataset.

Our second implementation will start with a refactor of the first implementation - taking what we have done and combining it with the lessons we learnt during our first implementation.

### Read & Merge Dataset

First we will refactor our code to read the dataset from the sqlite database and merge the tables together:

In [13]:
def create_dataset():
    conn = sqlite3.connect("./data/db.sqlite")
    articles = pd.read_sql_query("SELECT * from article", conn)
    print(f"articles sqlite table: {articles.shape}")
    newspapers = pd.read_sql_query("SELECT * from newspaper", conn)
    data = pd.merge(articles, newspapers, left_on="newspaper_id", right_on="id")
    data = data[
        [
            "headline",
            "body",
            "article_url",
            "date_published",
            "fancy_name",
            "newspaper_url",
            "newspaper_id",
        ]
    ]
    data = data.rename({"fancy_name": "newspaper"}, axis=1)
    return data

data = create_dataset()

articles sqlite table: (8534, 9)


### Create Holdout

We also refactor the code to create a holdout set:

In [14]:
def create_holdout(data, random_state=42):
    return train_test_split(
      data,
      test_size=0.2,
      random_state=42,
    )

data = create_dataset()
dev, holdout = create_holdout(data)

articles sqlite table: (8534, 9)


### Cross Validation

In our first implementation, we created a single test/train split.  Cross-validation is a more through way to perform model evaluation, at the cost of training time.

Let's implement our model pipeline using the two functions we developed above with cross validation:

In [15]:
from sklearn.model_selection import StratifiedKFold

data = create_dataset()
dev, holdout = create_holdout(data)

#  create 3 folds using stratified k fold
splitter = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)
for tr_idx, te_idx in splitter.split(dev, dev["newspaper_id"]):
    tr = dev.iloc[tr_idx]
    te = dev.iloc[te_idx]
    print(f" train: {tr.shape} test: {te.shape}")

articles sqlite table: (8534, 9)
 train: (4551, 7) test: (2276, 7)
 train: (4551, 7) test: (2276, 7)
 train: (4552, 7) test: (2275, 7)


Above we just create data - let's add in the code to train a model:

In [16]:
from sklearn.base import clone
from sklearn.dummy import DummyClassifier
from sklearn.metrics import accuracy_score

data = create_dataset()
dev, holdout = create_holdout(data)
mdl = DummyClassifier()

splitter = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)

results = []
for idx, (tr_idx, te_idx) in enumerate(splitter.split(dev, dev["newspaper_id"])):

    mdl = clone(mdl)
    tr = dev.iloc[tr_idx]
    te = dev.iloc[te_idx]
    print(f" {idx} train: {tr.shape} test: {te.shape}")

    labels = LabelEncoder()
    tr["target"] = labels.fit_transform(tr["newspaper"])
    tfidf = TfidfVectorizer(min_df=5, max_df=0.9, ngram_range=(1, 2))
    tr_features = tfidf.fit_transform(tr["body"])

    tr_score = accuracy_score(
        tr["target"], mdl.fit(tr_features, tr["target"]).predict(tr_features)
    )

    te["target"] = labels.transform(te["newspaper"])
    te_features = tfidf.transform(te["body"])
    te_score = accuracy_score(te["target"], mdl.predict(te_features))
    print(f" {idx} scores: train: {tr_score} test: {te_score}")
    results.append({"train-score": tr_score, "test-score": te_score})

print(pd.DataFrame(results))

articles sqlite table: (8534, 9)
 0 train: (4551, 7) test: (2276, 7)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  tr["target"] = labels.fit_transform(tr["newspaper"])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  te["target"] = labels.transform(te["newspaper"])


 0 scores: train: 0.1109646231597451 test: 0.11072056239015818
 1 train: (4551, 7) test: (2276, 7)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  tr["target"] = labels.fit_transform(tr["newspaper"])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  te["target"] = labels.transform(te["newspaper"])


 1 scores: train: 0.1109646231597451 test: 0.11072056239015818
 2 train: (4552, 7) test: (2275, 7)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  tr["target"] = labels.fit_transform(tr["newspaper"])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  te["target"] = labels.transform(te["newspaper"])


 2 scores: train: 0.11072056239015818 test: 0.11120879120879121
   train-score  test-score
0     0.110965    0.110721
1     0.110965    0.110721
2     0.110721    0.111209


```

articles sqlite table: (8534, 9)
 dev: (6827, 7) holdout: (1707, 7)
 0 train: (4551, 7) test: (2276, 7)
 0 scores: train: 0.11272247857613711 test: 0.11247803163444639
 1 train: (4551, 7) test: (2276, 7)
 1 scores: train: 0.11250274664908812 test: 0.11291739894551846
 2 train: (4552, 7) test: (2275, 7)
 2 scores: train: 0.11269771528998243 test: 0.11252747252747253
   train-score  test-score
0     0.112722    0.112478
1     0.112503    0.112917
2     0.112698    0.112527

```

## Iteration Three

### Refactor

Let's again take the chance to refactor our pipeline:

In [17]:
from sklearn.base import clone
from sklearn.dummy import DummyClassifier
from sklearn.metrics import accuracy_score


def train(mdl, data, encoders):
    #  copy to avoid pandas setting with copy warning
    data = data.copy()
    newspapers = data["newspaper"].copy()
    data.loc[:, "target"] = encoders["target"].fit_transform(newspapers)
    features = encoders["tfidf"].fit_transform(data["body"])
    return accuracy_score(
        data["target"], mdl.fit(features, data["target"]).predict(features)
    )


def test(mdl, data, encoders):
    #  copy to avoid pandas setting with copy warning
    data = data.copy()
    newspapers = data["newspaper"].copy()
    data.loc[:, "target"] = encoders["target"].transform(newspapers)
    features = encoders["tfidf"].transform(data["body"])
    return accuracy_score(data["target"], mdl.predict(features))


def pipeline(mdl, data):
    print(f"\npipeline {mdl}:")

    splitter = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)
    results = []
    for idx, (tr_idx, te_idx) in enumerate(splitter.split(data, data["newspaper_id"])):
        encoders = {
            "target": LabelEncoder(),
            "tfidf": TfidfVectorizer(
                min_df=5, max_df=0.9, max_features=1000, ngram_range=(1, 2)
            ),
        }
        mdl = clone(mdl)
        tr = data.iloc[tr_idx]
        te = data.iloc[te_idx]
        print(f" train: {tr.shape} test: {te.shape}")

        tr_score = train(mdl, tr, encoders)
        te_score = test(mdl, te, encoders)
        print(f" {idx} scores: train: {tr_score} test: {te_score}")
        results.append({"train-score": tr_score, "test-score": te_score})

    return pd.DataFrame(results).mean()

dev = create_dataset()
dev, holdout = create_holdout(dev)
mdl = DummyClassifier()
results = pipeline(mdl, dev)
print(results)

articles sqlite table: (8534, 9)

pipeline DummyClassifier():
 train: (4551, 7) test: (2276, 7)
 0 scores: train: 0.1109646231597451 test: 0.11072056239015818
 train: (4551, 7) test: (2276, 7)
 1 scores: train: 0.1109646231597451 test: 0.11072056239015818
 train: (4552, 7) test: (2275, 7)
 2 scores: train: 0.11072056239015818 test: 0.11120879120879121
train-score    0.110883
test-score     0.110883
dtype: float64


```
articles sqlite table: (8534, 9)

pipeline DummyClassifier():
 train: (4551, 7) test: (2276, 7)
 0 scores: train: 0.1109646231597451 test: 0.11072056239015818
 train: (4551, 7) test: (2276, 7)
 1 scores: train: 0.1109646231597451 test: 0.11072056239015818
 train: (4552, 7) test: (2275, 7)
 2 scores: train: 0.11072056239015818 test: 0.11120879120879121

train-score    0.110883
test-score     0.110883
dtype: float64
```

### Add Hyperparameter Tuning with a Grid Search

Let's pick some hyperparameters to tune and add a grid search to our pipeline:

In [18]:
from sklearn.model_selection import ParameterGrid

params = {
    "min_df": [1, 5, 10],
    "max_df": [0.5, 0.9],
    "ngram_range": [(1, 1), (2, 2)],
    "n_estimators": [100, 500],
    "max_depth": [None, 5],
    "max_features_tfidf": [100, 1000],
}
grid = ParameterGrid(params)
print(len(grid))

96


```
96
```

Now let's integrate these hyperparameters into our pipeline with a random forest - this requires adding these arguments to our `pipeline` function:

In [19]:
def pipeline(mdl, data, hyperparameters):
    print(f"pipeline {mdl}:")

    splitter = StratifiedKFold(n_splits=2, shuffle=True, random_state=42)
    results = []
    for idx, (tr_idx, te_idx) in enumerate(splitter.split(data, data["newspaper_id"])):
        encoders = {
            "target": LabelEncoder(),
            "tfidf": TfidfVectorizer(
                min_df=hyperparameters["min_df"],
                max_df=hyperparameters["max_df"],
                max_features=hyperparameters["max_features_tfidf"],
                ngram_range=hyperparameters["ngram_range"],
            ),
        }
        mdl = clone(mdl)
        tr = data.iloc[tr_idx]
        te = data.iloc[te_idx]
        print(f" train: {tr.shape} test: {te.shape}")

        tr_score = train(mdl, tr, encoders)
        te_score = test(mdl, te, encoders)
        print(f" fold {idx} scores: train: {tr_score} test: {te_score}")
        results.append({"train-score": tr_score, "test-score": te_score})
    return pd.DataFrame(results).mean().to_dict()

Now let's use this hyperparameter tuning:

In [None]:
grid_results = []
for idx, params in enumerate(grid):
    mdl = RandomForestClassifier(
        n_estimators=params["n_estimators"], max_depth=params["max_depth"]
    )
    print(f"\ngrid params {idx} of {len(grid)}:")
    result = pipeline(mdl, dev.copy(), params)
    grid_results.append({
      **params,
      **result,
      "run_id": idx
    })
    df = pd.DataFrame(grid_results)
    df.to_parquet(home / "grid.parquet")


grid params 0 of 96:
pipeline RandomForestClassifier():
 train: (3413, 7) test: (3414, 7)
 fold 0 scores: train: 0.998828010547905 test: 0.5509666080843585
 train: (3414, 7) test: (3413, 7)
 fold 1 scores: train: 0.9997070884592852 test: 0.6024025783767946

grid params 1 of 96:
pipeline RandomForestClassifier():
 train: (3413, 7) test: (3414, 7)


```
grid params 0 of 96:
pipeline RandomForestClassifier():
 train: (3413, 7) test: (3414, 7)
 fold 0 scores: train: 0.998828010547905 test: 0.5439367311072056
 train: (3414, 7) test: (3413, 7)
 fold 1 scores: train: 0.9997070884592852 test: 0.6032815704658658

grid params 1 of 96:
pipeline RandomForestClassifier():
 train: (3413, 7) test: (3414, 7)
 fold 0 scores: train: 0.9970700263697627 test: 0.515817223198594
 train: (3414, 7) test: (3413, 7)
 fold 1 scores: train: 0.997070884592853 test: 0.527395253442719

grid params 2 of 96:
pipeline RandomForestClassifier(n_estimators=500):
 train: (3413, 7) test: (3414, 7)
 fold 0 scores: train: 0.998828010547905 test: 0.5653192735793791
 train: (3414, 7) test: (3413, 7)
 fold 1 scores: train: 0.9997070884592852 test: 0.6126574860826253

grid params 3 of 96:
pipeline RandomForestClassifier(n_estimators=500):
 train: (3413, 7) test: (3414, 7)
 fold 0 scores: train: 0.9970700263697627 test: 0.531634446397188
 train: (3414, 7) test: (3413, 7)
 fold 1 scores: train: 0.997070884592853 test: 0.5411661295048344

grid params 4 of 96:
pipeline RandomForestClassifier():
 train: (3413, 7) test: (3414, 7)
 fold 0 scores: train: 0.998828010547905 test: 0.5456942003514939
 train: (3414, 7) test: (3413, 7)
 fold 1 scores: train: 0.9997070884592852 test: 0.6006445941986522

grid params 5 of 96:
pipeline RandomForestClassifier():
 train: (3413, 7) test: (3414, 7)
 fold 0 scores: train: 0.9970700263697627 test: 0.517867603983597
 train: (3414, 7) test: (3413, 7)
 fold 1 scores: train: 0.997070884592853 test: 0.5265162613536478

grid params 6 of 96:
pipeline RandomForestClassifier(n_estimators=500):
 train: (3413, 7) test: (3414, 7)
 fold 0 scores: train: 0.998828010547905 test: 0.5650263620386643
 train: (3414, 7) test: (3413, 7)
 fold 1 scores: train: 0.9997070884592852 test: 0.6164664518019338

grid params 7 of 96:
pipeline RandomForestClassifier(n_estimators=500):
 train: (3413, 7) test: (3414, 7)
 fold 0 scores: train: 0.9970700263697627 test: 0.5333919156414763
 train: (3414, 7) test: (3413, 7)
 fold 1 scores: train: 0.997070884592853 test: 0.5402871374157633

grid params 8 of 96:
pipeline RandomForestClassifier():
 train: (3413, 7) test: (3414, 7)
 fold 0 scores: train: 0.998828010547905 test: 0.5509666080843585
 train: (3414, 7) test: (3413, 7)
 fold 1 scores: train: 0.9997070884592852 test: 0.595370641664225

grid params 9 of 96:
pipeline RandomForestClassifier():
 train: (3413, 7) test: (3414, 7)
 fold 0 scores: train: 0.9970700263697627 test: 0.5175746924428822
 train: (3414, 7) test: (3413, 7)
 fold 1 scores: train: 0.997070884592853 test: 0.5297392323469089

grid params 10 of 96:
pipeline RandomForestClassifier(n_estimators=500):
 train: (3413, 7) test: (3414, 7)
 fold 0 scores: train: 0.998828010547905 test: 0.5656121851200937
 train: (3414, 7) test: (3413, 7)
 fold 1 scores: train: 0.9997070884592852 test: 0.6088485203633167

grid params 11 of 96:
pipeline RandomForestClassifier(n_estimators=500):
 train: (3413, 7) test: (3414, 7)
 fold 0 scores: train: 0.9970700263697627 test: 0.525776215582894
 train: (3414, 7) test: (3413, 7)
 fold 1 scores: train: 0.997070884592853 test: 0.5391151479636683

grid params 12 of 96:
pipeline RandomForestClassifier():
 train: (3413, 7) test: (3414, 7)
 fold 0 scores: train: 1.0 test: 0.7656707674282367
 train: (3414, 7) test: (3413, 7)
 fold 1 scores: train: 1.0 test: 0.7808379724582479

grid params 13 of 96:
pipeline RandomForestClassifier():
 train: (3413, 7) test: (3414, 7)
 fold 0 scores: train: 1.0 test: 0.7536613942589337
 train: (3414, 7) test: (3413, 7)
 fold 1 scores: train: 1.0 test: 0.7562261939642543

grid params 14 of 96:
pipeline RandomForestClassifier(n_estimators=500):
 train: (3413, 7) test: (3414, 7)
 fold 0 scores: train: 1.0 test: 0.7820738137082601
 train: (3414, 7) test: (3413, 7)
 fold 1 scores: train: 1.0 test: 0.7896278933489599

grid params 15 of 96:
pipeline RandomForestClassifier(n_estimators=500):
 train: (3413, 7) test: (3414, 7)
 fold 0 scores: train: 1.0 test: 0.7612770943175161
 train: (3414, 7) test: (3413, 7)
 fold 1 scores: train: 1.0 test: 0.7723410489305597

grid params 16 of 96:
pipeline RandomForestClassifier():
 train: (3413, 7) test: (3414, 7)
 fold 0 scores: train: 1.0 test: 0.7721148213239601
 train: (3414, 7) test: (3413, 7)
 fold 1 scores: train: 1.0 test: 0.7746850278347495

grid params 17 of 96:
pipeline RandomForestClassifier():
 train: (3413, 7) test: (3414, 7)
 fold 0 scores: train: 1.0 test: 0.7551259519625073
 train: (3414, 7) test: (3413, 7)
 fold 1 scores: train: 1.0 test: 0.760914151772634

grid params 18 of 96:
pipeline RandomForestClassifier(n_estimators=500):
 train: (3413, 7) test: (3414, 7)
 fold 0 scores: train: 1.0 test: 0.7867603983596954
 train: (3414, 7) test: (3413, 7)
 fold 1 scores: train: 1.0 test: 0.7869909170817463

grid params 19 of 96:
pipeline RandomForestClassifier(n_estimators=500):
 train: (3413, 7) test: (3414, 7)
 fold 0 scores: train: 1.0 test: 0.7650849443468073
 train: (3414, 7) test: (3413, 7)
 fold 1 scores: train: 1.0 test: 0.7650161148549663

grid params 20 of 96:
pipeline RandomForestClassifier():
 train: (3413, 7) test: (3414, 7)
 fold 0 scores: train: 1.0 test: 0.7674282366725249
 train: (3414, 7) test: (3413, 7)
 fold 1 scores: train: 1.0 test: 0.7699970700263697

grid params 21 of 96:
pipeline RandomForestClassifier():
 train: (3413, 7) test: (3414, 7)
 fold 0 scores: train: 1.0 test: 0.7480960749853545
 train: (3414, 7) test: (3413, 7)
 fold 1 scores: train: 1.0 test: 0.7606211544096103

grid params 22 of 96:
pipeline RandomForestClassifier(n_estimators=500):
 train: (3413, 7) test: (3414, 7)
 fold 0 scores: train: 1.0 test: 0.7888107791446983
 train: (3414, 7) test: (3413, 7)
 fold 1 scores: train: 1.0 test: 0.7820099619103428

grid params 23 of 96:
pipeline RandomForestClassifier(n_estimators=500):
 train: (3413, 7) test: (3414, 7)
 fold 0 scores: train: 1.0 test: 0.7644991212653779
 train: (3414, 7) test: (3413, 7)
 fold 1 scores: train: 1.0 test: 0.7656021095810138

grid params 24 of 96:
pipeline RandomForestClassifier():
 train: (3413, 7) test: (3414, 7)
 fold 0 scores: train: 1.0 test: 0.46367896895137667
 train: (3414, 7) test: (3413, 7)
 fold 1 scores: train: 1.0 test: 0.4825666569000879

grid params 25 of 96:
pipeline RandomForestClassifier():
 train: (3413, 7) test: (3414, 7)
 fold 0 scores: train: 0.9985350131848814 test: 0.4938488576449912
 train: (3414, 7) test: (3413, 7)
 fold 1 scores: train: 0.9994141769185706 test: 0.5303252270729564

grid params 26 of 96:
pipeline RandomForestClassifier(n_estimators=500):
 train: (3413, 7) test: (3414, 7)
 fold 0 scores: train: 1.0 test: 0.4953134153485647
 train: (3414, 7) test: (3413, 7)
 fold 1 scores: train: 1.0 test: 0.5139173747436273

grid params 27 of 96:
pipeline RandomForestClassifier(n_estimators=500):
 train: (3413, 7) test: (3414, 7)
 fold 0 scores: train: 0.9985350131848814 test: 0.5146455770357352
 train: (3414, 7) test: (3413, 7)
 fold 1 scores: train: 0.9994141769185706 test: 0.5443891004980955

grid params 28 of 96:
pipeline RandomForestClassifier():
 train: (3413, 7) test: (3414, 7)
 fold 0 scores: train: 1.0 test: 0.46953719976567077
 train: (3414, 7) test: (3413, 7)
 fold 1 scores: train: 1.0 test: 0.48901259888661003

grid params 29 of 96:
pipeline RandomForestClassifier():
 train: (3413, 7) test: (3414, 7)
 fold 0 scores: train: 0.9985350131848814 test: 0.4991212653778559
 train: (3414, 7) test: (3413, 7)
 fold 1 scores: train: 0.9994141769185706 test: 0.5285672428948139

grid params 30 of 96:
pipeline RandomForestClassifier(n_estimators=500):
 train: (3413, 7) test: (3414, 7)
 fold 0 scores: train: 1.0 test: 0.4956063268892794
 train: (3414, 7) test: (3413, 7)
 fold 1 scores: train: 1.0 test: 0.5095224142982713

grid params 31 of 96:
pipeline RandomForestClassifier(n_estimators=500):
 train: (3413, 7) test: (3414, 7)
 fold 0 scores: train: 0.9985350131848814 test: 0.520796719390744
 train: (3414, 7) test: (3413, 7)
 fold 1 scores: train: 0.9994141769185706 test: 0.5399941400527395

grid params 32 of 96:
pipeline RandomForestClassifier():
 train: (3413, 7) test: (3414, 7)
 fold 0 scores: train: 1.0 test: 0.471294669009959
 train: (3414, 7) test: (3413, 7)
 fold 1 scores: train: 1.0 test: 0.4796366832698506

grid params 33 of 96:
pipeline RandomForestClassifier():
 train: (3413, 7) test: (3414, 7)
 fold 0 scores: train: 0.9985350131848814 test: 0.5002929115407148
 train: (3414, 7) test: (3413, 7)
 fold 1 scores: train: 0.9994141769185706 test: 0.5282742455317903

grid params 34 of 96:
pipeline RandomForestClassifier(n_estimators=500):
 train: (3413, 7) test: (3414, 7)
 fold 0 scores: train: 1.0 test: 0.4944346807264206
 train: (3414, 7) test: (3413, 7)
 fold 1 scores: train: 1.0 test: 0.5183123351889833

grid params 35 of 96:
pipeline RandomForestClassifier(n_estimators=500):
 train: (3413, 7) test: (3414, 7)
 fold 0 scores: train: 0.9985350131848814 test: 0.5219683655536028
 train: (3414, 7) test: (3413, 7)
 fold 1 scores: train: 0.9994141769185706 test: 0.5470260767653091

grid params 36 of 96:
pipeline RandomForestClassifier():
 train: (3413, 7) test: (3414, 7)
 fold 0 scores: train: 1.0 test: 0.7662565905096661
 train: (3414, 7) test: (3413, 7)
 fold 1 scores: train: 1.0 test: 0.7764430120128919

grid params 37 of 96:
pipeline RandomForestClassifier():
 train: (3413, 7) test: (3414, 7)
 fold 0 scores: train: 1.0 test: 0.7480960749853545
 train: (3414, 7) test: (3413, 7)
 fold 1 scores: train: 1.0 test: 0.7562261939642543

grid params 38 of 96:
pipeline RandomForestClassifier(n_estimators=500):
 train: (3413, 7) test: (3414, 7)
 fold 0 scores: train: 1.0 test: 0.784124194493263
 train: (3414, 7) test: (3413, 7)
 fold 1 scores: train: 1.0 test: 0.7773220041019631

grid params 39 of 96:
pipeline RandomForestClassifier(n_estimators=500):
 train: (3413, 7) test: (3414, 7)
 fold 0 scores: train: 1.0 test: 0.7565905096660809
 train: (3414, 7) test: (3413, 7)
 fold 1 scores: train: 1.0 test: 0.76530911221799

grid params 40 of 96:
pipeline RandomForestClassifier():
 train: (3413, 7) test: (3414, 7)
 fold 0 scores: train: 1.0 test: 0.7568834212067955
 train: (3414, 7) test: (3413, 7)
 fold 1 scores: train: 1.0 test: 0.766481101670085

grid params 41 of 96:
pipeline RandomForestClassifier():
 train: (3413, 7) test: (3414, 7)
 fold 0 scores: train: 1.0 test: 0.7495606326889279
 train: (3414, 7) test: (3413, 7)
 fold 1 scores: train: 1.0 test: 0.7503662467037797

grid params 42 of 96:
pipeline RandomForestClassifier(n_estimators=500):
 train: (3413, 7) test: (3414, 7)
 fold 0 scores: train: 1.0 test: 0.7817809021675454
 train: (3414, 7) test: (3413, 7)
 fold 1 scores: train: 1.0 test: 0.7767360093759156

grid params 43 of 96:
pipeline RandomForestClassifier(n_estimators=500):
 train: (3413, 7) test: (3414, 7)
 fold 0 scores: train: 1.0 test: 0.7580550673696543
 train: (3414, 7) test: (3413, 7)
 fold 1 scores: train: 1.0 test: 0.7635511280398476

grid params 44 of 96:
pipeline RandomForestClassifier():
 train: (3413, 7) test: (3414, 7)
 fold 0 scores: train: 1.0 test: 0.7653778558875219
 train: (3414, 7) test: (3413, 7)
 fold 1 scores: train: 1.0 test: 0.7644301201289189

grid params 45 of 96:
pipeline RandomForestClassifier():
 train: (3413, 7) test: (3414, 7)
 fold 0 scores: train: 1.0 test: 0.7533684827182191
 train: (3414, 7) test: (3413, 7)
 fold 1 scores: train: 1.0 test: 0.7453852915323762

grid params 46 of 96:
pipeline RandomForestClassifier(n_estimators=500):
 train: (3413, 7) test: (3414, 7)
 fold 0 scores: train: 1.0 test: 0.7782659636789689
 train: (3414, 7) test: (3413, 7)
 fold 1 scores: train: 1.0 test: 0.778493993554058

grid params 47 of 96:
pipeline RandomForestClassifier(n_estimators=500):
 train: (3413, 7) test: (3414, 7)
 fold 0 scores: train: 1.0 test: 0.7589338019917985
 train: (3414, 7) test: (3413, 7)
 fold 1 scores: train: 1.0 test: 0.7615001464986815

grid params 48 of 96:
pipeline RandomForestClassifier(max_depth=5):
 train: (3413, 7) test: (3414, 7)
 fold 0 scores: train: 0.543803105772048 test: 0.46104276508494435
 train: (3414, 7) test: (3413, 7)
 fold 1 scores: train: 0.5641476274165202 test: 0.5010254907705831

grid params 49 of 96:
pipeline RandomForestClassifier(max_depth=5):
 train: (3413, 7) test: (3414, 7)
 fold 0 scores: train: 0.5092294169352476 test: 0.44141769185705915
 train: (3414, 7) test: (3413, 7)
 fold 1 scores: train: 0.5026362038664324 test: 0.46205684148842663

grid params 50 of 96:
pipeline RandomForestClassifier(max_depth=5, n_estimators=500):
 train: (3413, 7) test: (3414, 7)
 fold 0 scores: train: 0.5364781716964547 test: 0.4513766842413591
 train: (3414, 7) test: (3413, 7)
 fold 1 scores: train: 0.5767428236672525 test: 0.5086434222092001

grid params 51 of 96:
pipeline RandomForestClassifier(max_depth=5, n_estimators=500):
 train: (3413, 7) test: (3414, 7)
 fold 0 scores: train: 0.5162613536478172 test: 0.45166959578207383
 train: (3414, 7) test: (3413, 7)
 fold 1 scores: train: 0.5041007615700058 test: 0.46088485203633167

grid params 52 of 96:
pipeline RandomForestClassifier(max_depth=5):
 train: (3413, 7) test: (3414, 7)
 fold 0 scores: train: 0.5367711690594784 test: 0.45547744581136496
 train: (3414, 7) test: (3413, 7)
 fold 1 scores: train: 0.5817223198594025 test: 0.5092294169352476

grid params 53 of 96:
pipeline RandomForestClassifier(max_depth=5):
 train: (3413, 7) test: (3414, 7)
 fold 0 scores: train: 0.5092294169352476 test: 0.4361452841241945
 train: (3414, 7) test: (3413, 7)
 fold 1 scores: train: 0.5117164616285882 test: 0.46352182830354527

grid params 54 of 96:
pipeline RandomForestClassifier(max_depth=5, n_estimators=500):
 train: (3413, 7) test: (3414, 7)
 fold 0 scores: train: 0.5440961031350717 test: 0.45547744581136496
 train: (3414, 7) test: (3413, 7)
 fold 1 scores: train: 0.5749853544229643 test: 0.5068854380310577

grid params 55 of 96:
pipeline RandomForestClassifier(max_depth=5, n_estimators=500):
 train: (3413, 7) test: (3414, 7)
 fold 0 scores: train: 0.5159683562847934 test: 0.4507908611599297
 train: (3414, 7) test: (3413, 7)
 fold 1 scores: train: 0.5087873462214412 test: 0.4696747729270437

grid params 56 of 96:
pipeline RandomForestClassifier(max_depth=5):
 train: (3413, 7) test: (3414, 7)
 fold 0 scores: train: 0.5338411954292411 test: 0.45489162272993555
 train: (3414, 7) test: (3413, 7)
 fold 1 scores: train: 0.5758640890451083 test: 0.5016114854966305

grid params 57 of 96:
pipeline RandomForestClassifier(max_depth=5):
 train: (3413, 7) test: (3414, 7)
 fold 0 scores: train: 0.5162613536478172 test: 0.45254833040421794
 train: (3414, 7) test: (3413, 7)
 fold 1 scores: train: 0.49502050380785 test: 0.45619689422795195

grid params 58 of 96:
pipeline RandomForestClassifier(max_depth=5, n_estimators=500):
 train: (3413, 7) test: (3414, 7)
 fold 0 scores: train: 0.5476120714913566 test: 0.45694200351493847
 train: (3414, 7) test: (3413, 7)
 fold 1 scores: train: 0.5732278851786761 test: 0.5068854380310577

grid params 59 of 96:
pipeline RandomForestClassifier(max_depth=5, n_estimators=500):
 train: (3413, 7) test: (3414, 7)
 fold 0 scores: train: 0.5177263404629359 test: 0.4502050380785003
 train: (3414, 7) test: (3413, 7)
 fold 1 scores: train: 0.4997070884592853 test: 0.46000585994726045

grid params 60 of 96:
pipeline RandomForestClassifier(max_depth=5):
 train: (3413, 7) test: (3414, 7)
 fold 0 scores: train: 0.6229123937884559 test: 0.5928529584065613
 train: (3414, 7) test: (3413, 7)
 fold 1 scores: train: 0.635618043350908 test: 0.6196894227951948

grid params 61 of 96:
pipeline RandomForestClassifier(max_depth=5):
 train: (3413, 7) test: (3414, 7)
 fold 0 scores: train: 0.5039554644008204 test: 0.4967779730521383
 train: (3414, 7) test: (3413, 7)
 fold 1 scores: train: 0.5263620386643234 test: 0.5048344564898916

grid params 62 of 96:
pipeline RandomForestClassifier(max_depth=5, n_estimators=500):
 train: (3413, 7) test: (3414, 7)
 fold 0 scores: train: 0.6243773806035746 test: 0.6028119507908611
 train: (3414, 7) test: (3413, 7)
 fold 1 scores: train: 0.6244874048037493 test: 0.6138294755347202

grid params 63 of 96:
pipeline RandomForestClassifier(max_depth=5, n_estimators=500):
 train: (3413, 7) test: (3414, 7)
 fold 0 scores: train: 0.4992675065924407 test: 0.5014645577035736
 train: (3414, 7) test: (3413, 7)
 fold 1 scores: train: 0.5275336848271822 test: 0.5062994433050102

grid params 64 of 96:
pipeline RandomForestClassifier(max_depth=5):
 train: (3413, 7) test: (3414, 7)
 fold 0 scores: train: 0.6191034280691474 test: 0.5987111892208553
 train: (3414, 7) test: (3413, 7)
 fold 1 scores: train: 0.6250732278851787 test: 0.6106065045414591

grid params 65 of 96:
pipeline RandomForestClassifier(max_depth=5):
 train: (3413, 7) test: (3414, 7)
 fold 0 scores: train: 0.4895985936126575 test: 0.4874048037492677
 train: (3414, 7) test: (3413, 7)
 fold 1 scores: train: 0.5087873462214412 test: 0.48842660416056255

grid params 66 of 96:
pipeline RandomForestClassifier(max_depth=5, n_estimators=500):
 train: (3413, 7) test: (3414, 7)
 fold 0 scores: train: 0.6223263990624084 test: 0.6010544815465729
 train: (3414, 7) test: (3413, 7)
 fold 1 scores: train: 0.6294669009958992 test: 0.6144154702607677

grid params 67 of 96:
pipeline RandomForestClassifier(max_depth=5, n_estimators=500):
 train: (3413, 7) test: (3414, 7)
 fold 0 scores: train: 0.4939935540580135 test: 0.4947275922671353
 train: (3414, 7) test: (3413, 7)
 fold 1 scores: train: 0.5202108963093146 test: 0.5001464986815118

grid params 68 of 96:
pipeline RandomForestClassifier(max_depth=5):
 train: (3413, 7) test: (3414, 7)
 fold 0 scores: train: 0.6276003515968356 test: 0.6028119507908611
 train: (3414, 7) test: (3413, 7)
 fold 1 scores: train: 0.6303456356180434 test: 0.6158804570758863

grid params 69 of 96:
pipeline RandomForestClassifier(max_depth=5):
 train: (3413, 7) test: (3414, 7)
 fold 0 scores: train: 0.4960445355991796 test: 0.4976567076742824
 train: (3414, 7) test: (3413, 7)
 fold 1 scores: train: 0.5304628002343292 test: 0.51098740111339

grid params 70 of 96:
pipeline RandomForestClassifier(max_depth=5, n_estimators=500):
 train: (3413, 7) test: (3414, 7)
 fold 0 scores: train: 0.6255493700556695 test: 0.6051552431165788
 train: (3414, 7) test: (3413, 7)
 fold 1 scores: train: 0.6280023432923257 test: 0.6114854966305303

grid params 71 of 96:
pipeline RandomForestClassifier(max_depth=5, n_estimators=500):
 train: (3413, 7) test: (3414, 7)
 fold 0 scores: train: 0.4939935540580135 test: 0.4953134153485647
 train: (3414, 7) test: (3413, 7)
 fold 1 scores: train: 0.5202108963093146 test: 0.4986815118663932

grid params 72 of 96:
pipeline RandomForestClassifier(max_depth=5):
 train: (3413, 7) test: (3414, 7)
 fold 0 scores: train: 0.476999707002637 test: 0.40392501464557706
 train: (3414, 7) test: (3413, 7)
 fold 1 scores: train: 0.46748681898066785 test: 0.3964254321711105

grid params 73 of 96:
pipeline RandomForestClassifier(max_depth=5):
 train: (3413, 7) test: (3414, 7)
 fold 0 scores: train: 0.4881336067975388 test: 0.42618629173989453
 train: (3414, 7) test: (3413, 7)
 fold 1 scores: train: 0.4876977152899824 test: 0.45531790213888074

grid params 74 of 96:
pipeline RandomForestClassifier(max_depth=5, n_estimators=500):
 train: (3413, 7) test: (3414, 7)
 fold 0 scores: train: 0.47729270436566074 test: 0.4065612185120094
 train: (3414, 7) test: (3413, 7)
 fold 1 scores: train: 0.4710017574692443 test: 0.3975974216232054

grid params 75 of 96:
pipeline RandomForestClassifier(max_depth=5, n_estimators=500):
 train: (3413, 7) test: (3414, 7)
 fold 0 scores: train: 0.4887196015235863 test: 0.4273579379027534
 train: (3414, 7) test: (3413, 7)
 fold 1 scores: train: 0.4956063268892794 test: 0.45531790213888074

grid params 76 of 96:
pipeline RandomForestClassifier(max_depth=5):
 train: (3413, 7) test: (3414, 7)
 fold 0 scores: train: 0.47582771755054204 test: 0.39777387229056826
 train: (3414, 7) test: (3413, 7)
 fold 1 scores: train: 0.46309314586994726 test: 0.38997949018458833

grid params 77 of 96:
pipeline RandomForestClassifier(max_depth=5):
 train: (3413, 7) test: (3414, 7)
 fold 0 scores: train: 0.48725461470846765 test: 0.4270650263620387
 train: (3414, 7) test: (3413, 7)
 fold 1 scores: train: 0.4938488576449912 test: 0.45678288895399943

grid params 78 of 96:
pipeline RandomForestClassifier(max_depth=5, n_estimators=500):
 train: (3413, 7) test: (3414, 7)
 fold 0 scores: train: 0.4761207149135658 test: 0.4091974223784417
 train: (3414, 7) test: (3413, 7)
 fold 1 scores: train: 0.47656707674282367 test: 0.4028713741576326

grid params 79 of 96:
pipeline RandomForestClassifier(max_depth=5, n_estimators=500):
 train: (3413, 7) test: (3414, 7)
 fold 0 scores: train: 0.4881336067975388 test: 0.4270650263620387
 train: (3414, 7) test: (3413, 7)
 fold 1 scores: train: 0.4838898652606913 test: 0.45414591268678584

grid params 80 of 96:
pipeline RandomForestClassifier(max_depth=5):
 train: (3413, 7) test: (3414, 7)
 fold 0 scores: train: 0.46674479929680635 test: 0.3992384299941418
 train: (3414, 7) test: (3413, 7)
 fold 1 scores: train: 0.46133567662565905 test: 0.3917374743627307

grid params 81 of 96:
pipeline RandomForestClassifier(max_depth=5):
 train: (3413, 7) test: (3414, 7)
 fold 0 scores: train: 0.4925285672428948 test: 0.4308728763913298
 train: (3414, 7) test: (3413, 7)
 fold 1 scores: train: 0.4938488576449912 test: 0.45268092587166714

grid params 82 of 96:
pipeline RandomForestClassifier(max_depth=5, n_estimators=500):
 train: (3413, 7) test: (3414, 7)
 fold 0 scores: train: 0.47846469381775564 test: 0.4124194493263035
 train: (3414, 7) test: (3413, 7)
 fold 1 scores: train: 0.47539543057996486 test: 0.40140638734251394

grid params 83 of 96:
pipeline RandomForestClassifier(max_depth=5, n_estimators=500):
 train: (3413, 7) test: (3414, 7)
 fold 0 scores: train: 0.48842660416056255 test: 0.4288224956063269
 train: (3414, 7) test: (3413, 7)
 fold 1 scores: train: 0.484182776801406 test: 0.4532669205977146

grid params 84 of 96:
pipeline RandomForestClassifier(max_depth=5):
 train: (3413, 7) test: (3414, 7)
 fold 0 scores: train: 0.6138294755347202 test: 0.5978324545987111
 train: (3414, 7) test: (3413, 7)
 fold 1 scores: train: 0.6262448740480375 test: 0.6182244359800761

grid params 85 of 96:
pipeline RandomForestClassifier(max_depth=5):
 train: (3413, 7) test: (3414, 7)
 fold 0 scores: train: 0.5517140345736888 test: 0.5442296426479203
 train: (3414, 7) test: (3413, 7)
 fold 1 scores: train: 0.5656121851200937 test: 0.5470260767653091

grid params 86 of 96:
pipeline RandomForestClassifier(max_depth=5, n_estimators=500):
 train: (3413, 7) test: (3414, 7)
 fold 0 scores: train: 0.6229123937884559 test: 0.6036906854130053
 train: (3414, 7) test: (3413, 7)
 fold 1 scores: train: 0.6236086701816052 test: 0.6111924992675066

grid params 87 of 96:
pipeline RandomForestClassifier(max_depth=5, n_estimators=500):
 train: (3413, 7) test: (3414, 7)
 fold 0 scores: train: 0.5514210372106652 test: 0.5468658465143527
 train: (3414, 7) test: (3413, 7)
 fold 1 scores: train: 0.5612185120093731 test: 0.5399941400527395

grid params 88 of 96:
pipeline RandomForestClassifier(max_depth=5):
 train: (3413, 7) test: (3414, 7)
 fold 0 scores: train: 0.6220334016993847 test: 0.6031048623315759
 train: (3414, 7) test: (3413, 7)
 fold 1 scores: train: 0.6203866432337434 test: 0.6073835335481981

grid params 89 of 96:
pipeline RandomForestClassifier(max_depth=5):
 train: (3413, 7) test: (3414, 7)
 fold 0 scores: train: 0.5411661295048344 test: 0.5392501464557704
 train: (3414, 7) test: (3413, 7)
 fold 1 scores: train: 0.5852372583479789 test: 0.5719308526223263

grid params 90 of 96:
pipeline RandomForestClassifier(max_depth=5, n_estimators=500):
 train: (3413, 7) test: (3414, 7)
 fold 0 scores: train: 0.6276003515968356 test: 0.6031048623315759
 train: (3414, 7) test: (3413, 7)
 fold 1 scores: train: 0.616871704745167 test: 0.6044535599179608

grid params 91 of 96:
pipeline RandomForestClassifier(max_depth=5, n_estimators=500):
 train: (3413, 7) test: (3414, 7)
 fold 0 scores: train: 0.5487840609434516 test: 0.5448154657293497
 train: (3414, 7) test: (3413, 7)
 fold 1 scores: train: 0.5796719390743995 test: 0.5599179607383533

grid params 92 of 96:
pipeline RandomForestClassifier(max_depth=5):
 train: (3413, 7) test: (3414, 7)
 fold 0 scores: train: 0.6246703779665983 test: 0.6060339777387229
 train: (3414, 7) test: (3413, 7)
 fold 1 scores: train: 0.6285881663737551 test: 0.6147084676237914

grid params 93 of 96:
pipeline RandomForestClassifier(max_depth=5):
 train: (3413, 7) test: (3414, 7)
 fold 0 scores: train: 0.5511280398476414 test: 0.5459871118922085
 train: (3414, 7) test: (3413, 7)
 fold 1 scores: train: 0.5828939660222613 test: 0.5648989159097568

grid params 94 of 96:
pipeline RandomForestClassifier(max_depth=5, n_estimators=500):
 train: (3413, 7) test: (3414, 7)
 fold 0 scores: train: 0.624963375329622 test: 0.6013473930872877
 train: (3414, 7) test: (3413, 7)
 fold 1 scores: train: 0.6183362624487405 test: 0.604160562554937

grid params 95 of 96:
pipeline RandomForestClassifier(max_depth=5, n_estimators=500):
 train: (3413, 7) test: (3414, 7)
 fold 0 scores: train: 0.5432171110460006 test: 0.5424721734036321
 train: (3414, 7) test: (3413, 7)
 fold 1 scores: train: 0.5796719390743995 test: 0.5637269264576619

```

### Select & Train on Best Params 

Now our grid search is over, we can select the best parameters:

In [None]:
grid = pd.read_parquet(home / "grid.parquet")

#  select best test-score row
best_fold = grid.sort_values("test-score", ascending=False).iloc[0]

best_fold = best_fold.to_dict()
print(best_fold)

```
{'max_depth': nan, 'max_df': 0.5, 'max_features_tfidf': 1000, 'min_df': 5, 'n_estimators': 500, 'ngram_range': array([1, 1]), 'train-score': 1.0, 'test-score': 0.7868756577207208, 'run_id': 18}
```

We can now use our `test` and `train` functions again with our final, best hyperparameters:

In [None]:
data = create_dataset()
tr, ho = create_holdout(data)

if np.isnan(best_fold['max_depth']):
    best_fold['max_depth'] = None
else:
    best_fold['max_depth'] = int(best_fold['max_depth'])
    
print(f"best_params: {best_fold}")
mdl = RandomForestClassifier(
    n_estimators=best_fold["n_estimators"], max_depth=best_fold["max_depth"] 
)
encoders = {
    "target": LabelEncoder(),
    "tfidf": TfidfVectorizer(
        min_df=best_fold["min_df"],
        max_df=best_fold["max_df"],
        max_features=best_fold["max_features_tfidf"],
        ngram_range=tuple(best_fold["ngram_range"]),
    ),
}

tr_score = train(mdl, tr, encoders)
te_score = test(mdl, ho, encoders)
print(f" scores train: {tr_score} test: {te_score}")

```
articles sqlite table: (8534, 9)
best_params: {'max_depth': None, 'max_df': 0.5, 'max_features_tfidf': 1000, 'min_df': 5, 'n_estimators': 500, 'ngram_range': array([1, 1]), 'train-score': 1.0, 'test-score': 0.7868756577207208, 'run_id': 18}
 scores train: 1.0 test: 0.8131224370240188
```

### Train Final Model

We have our final estimate of model performance - with an expected performance of predicting the newspaper correctly 80% of the time.

Our final model that would go into production should be trained on as much data as possible - the entire dataset:

In [None]:
tr = create_dataset()
mdl = RandomForestClassifier(
    n_estimators=best_fold["n_estimators"], max_depth=best_fold["max_depth"]
)
encoders = {
    "target": LabelEncoder(),
    "tfidf": TfidfVectorizer(
        min_df=best_fold["min_df"],
        max_df=best_fold["max_df"],
        max_features=best_fold["max_features_tfidf"],
        ngram_range=tuple(best_fold["ngram_range"]),
    ),
}
tr_score = train(mdl, tr, encoders)
print(f" scores train: {tr_score}")

```
articles sqlite table: (8534, 9)
 scores train: 1.0
```

We are left with only as estimate of the performance on the training set - our final model is trained on all the data.  We still expect this model to perform with around 80% accuracy.