# Improving performance

We'll focus on improvements that can be made in these three areas:

1. Text processing

    - tokenize on punctuation to avoid hyphens, underscores, etc.
    - include unigrams and bi-grams in the model. By doing so we're more likely to capture important information involving multiple tokens, e.g. 'middle school'.

Sklearn facilitates this through various parameters we can pass to `CountVectorizer` in our preprocessing steps to our pipeline. We can customizer the vectorizer to only accept alphanumeric characters in tokens and include `1-gram` and `2-grams` in the vectorization.

```py
TOKENS_ALPHANUMERIC = '[A-Za-z0-9]+(?=\\s+)'

vec = CountVectorizer(token_pattern=TOKENS_ALPHANUMERIC, ngram_range=(1,2))
```

2. Statistical methods - Interaction modeling

We can use n-grams to capture sequences of words when they appear in order. This will not work when terms are not next to each other, e.g. '3rd grade - budget for English teacher' and 'English teacher for 3rd grade'. `Interaction Terms` allows us to mathematically describe when terms appear togeher. Sklearn implements `Interaction Terms` through it's `Polynomial Features` function.

`degree` - number of columns to include, performance hit if too many features included
`interaction_only=True` - tells sklearn not to multiply a column by itself.

Insert the step after the preprocessing steps, but before the classifier steps.

```py
from sklearn.preprocessing import Polynomialfeatures

interaction = PolynomialFeatures(degree=2, interaction_only=True, include_bias=False)
```

3. Computational effifcency using a hashing function

As the number of features in our model grows, the size of the resulting matrix created also grows as does the length of time and amount of memeory required.

Hashing is a way of increasing memory efficiency, without sacrificing model accuracy. The hashing function takes an input, the token, and outputs a hash value, which could be a string or integer. We explicitly state how many outputs the hashing function can have, e.g. 250. The hashing function will automatically map each token to one of the 250 values/columns. Some columns will have more than one token mapped to it. This does not affect model accuracy.

By explicitly stating how many possible outputs the hashing function may have, we limit the size of the objects that need to be processed. With these limits known, computation can be made more efficient and we can get results faster, even on large datasets.

Hashing is used in **Dimensionality Reduction**, making the array of features as small as possible. Which is particularly useful when the dataset is particularly large.

Sklearn implements hashing through the `HashingVectorizer`, which we can use instead of the `CountVectorizer`.

```py
from sklearn.feature_extraction.text import HashingVectorizer

vec = Hashingvectorizer(
    norm=None,
    non_negative=True,
    token_pattern=TOKENS_ALPHANUMERIC,
    ngram_range(1, 2)
)
```

A `HashingVectorizer` acts just like `CountVectorizer` in that it can accept `token_pattern` and `ngram_range` parameters. The important difference is that it creates hash values from the text, so that we get all the computational advantages of hashing!

```py
from sklearn.feature_extraction.text import HashingVectorizer

# Get text data: text_data
text_data = combine_text_columns(X_train)

# Create the token pattern: TOKENS_ALPHANUMERIC
TOKENS_ALPHANUMERIC = '[A-Za-z0-9]+(?=\\s+)' 

# Instantiate the HashingVectorizer: hashing_vec
hashing_vec = HashingVectorizer(token_pattern=TOKENS_ALPHANUMERIC)

# Fit and transform the Hashing Vectorizer
hashed_text = hashing_vec.fit_transform(text_data)

# Create DataFrame and print the head
hashed_df = pd.DataFrame(hashed_text.data)
print(hashed_df.head())
```

```py
          0
0 -0.160128
1  0.160128
2 -0.480384
3 -0.320256
4  0.160128
```

Further steps to improve the model:

- further NLP techniques, such as stemming and stop-word removal
- try a different model, e.g. `RandomForest`, `KNN`, `Naive Bayes`, etc.
- further numeric preprocessing, e.g. different imputation strategy other than the default of using `NaN`.
- further optimsation, such as using a `GridSearch` over pipeline objects.
