# Text Classification with Bag of Words - Natural Language Processing

![](https://i.imgur.com/hlEQ5X8.png)

> _"Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data."_ - Wikipedia

> _**Bag of Words**: The bag-of-words (BOW) model is a representation that turns arbitrary text into fixed-length vectors by counting how many times each word appears._

Outline:

1. Download and explore a real-world dataset
2. Apply text preprocessing techniques
3. Implement the bag of words model
4. Train ML models for text classification
5. Make predictions and submit to Kaggle


Dataset: https://www.kaggle.com/c/quora-insincere-questions-classification


## Download and Explore the Data

Outline:

1. Download the dataset from Kaggle to Colab
2. Explore the data using Pandas
3. Create a small working sample

### Download the Data to Colab

Upload your `kaggle.json` to Colab. Get it here: https://www.kaggle.com/docs/api#authentication


In [None]:
!ls .

sample_data


In [None]:
import os

In [None]:
os.chmod("/content/kaggle.json",600)

FileNotFoundError: ignored

In [None]:
os.environ ['KAGGLE_CONFIG_DIR'] = '.'

In [None]:
!kaggle competitions download -c quora-insincere-questions-classification -f train.csv -p data

Traceback (most recent call last):
  File "/usr/local/bin/kaggle", line 5, in <module>
    from kaggle.cli import main
  File "/usr/local/lib/python3.10/dist-packages/kaggle/__init__.py", line 23, in <module>
    api.authenticate()
  File "/usr/local/lib/python3.10/dist-packages/kaggle/api/kaggle_api_extended.py", line 164, in authenticate
    raise IOError('Could not find {}. Make sure it\'s located in'
OSError: Could not find kaggle.json. Make sure it's located in .. Or use the environment method.


In [None]:
IS_KAGGLE = 'KAGGLE_KERNEL_RUN_TYPE' in os.environ

In [None]:
if IS_KAGGLE:
    data_dir = '../input/quora-insincere-questions-classification'
    train_fname = data_dir + '/train.csv'
    test_fname = data_dir + '/test.csv'
    sample_fname = data_dir + '/sample_submission.csv'
else:
    os.environ['KAGGLE_CONFIG_DIR'] = '.'
    !kaggle competitions download -c quora-insincere-questions-classification -f train.csv -p data
    !kaggle competitions download -c quora-insincere-questions-classification -f test.csv -p data
    !kaggle competitions download -c quora-insincere-questions-classification -f sample_submission.csv -p data
    train_fname = 'data/train.csv.zip'
    test_fname = 'data/test.csv.zip'
    sample_fname = 'data/sample_submission.csv.zip'

Traceback (most recent call last):
  File "/usr/local/bin/kaggle", line 5, in <module>
    from kaggle.cli import main
  File "/usr/local/lib/python3.10/dist-packages/kaggle/__init__.py", line 23, in <module>
    api.authenticate()
  File "/usr/local/lib/python3.10/dist-packages/kaggle/api/kaggle_api_extended.py", line 164, in authenticate
    raise IOError('Could not find {}. Make sure it\'s located in'
OSError: Could not find kaggle.json. Make sure it's located in .. Or use the environment method.
Traceback (most recent call last):
  File "/usr/local/bin/kaggle", line 5, in <module>
    from kaggle.cli import main
  File "/usr/local/lib/python3.10/dist-packages/kaggle/__init__.py", line 23, in <module>
    api.authenticate()
  File "/usr/local/lib/python3.10/dist-packages/kaggle/api/kaggle_api_extended.py", line 164, in authenticate
    raise IOError('Could not find {}. Make sure it\'s located in'
OSError: Could not find kaggle.json. Make sure it's located in .. Or use the environmen

### Explore the Data using Pandas

In [None]:
import pandas as pd

In [None]:
raw_df = pd.read_csv(train_fname)

In [None]:
raw_df

In [None]:
sincere_df = raw_df[raw_df.target == 0]

In [None]:
sincere_df.question_text.values[:10]

In [None]:
insincere_df = raw_df[raw_df.target == 1]

In [None]:
insincere_df.question_text.values[:10]

In [None]:
raw_df.target.value_counts(normalize=True)

In [None]:
raw_df.target.value_counts(normalize=True).plot(kind='bar')

In [None]:
test_df = pd.read_csv(test_fname)

In [None]:
test_df

In [None]:
sub_df = pd.read_csv(sample_fname)

In [None]:
sub_df

In [None]:
sub_df.prediction.value_counts()

### Create a Working Sample

In [None]:
if IS_KAGGLE:
    SAMPLE_SIZE = len(raw_df)
else:
    SAMPLE_SIZE = 100_000

In [None]:
sample_df = raw_df.sample(SAMPLE_SIZE, random_state=42)

In [None]:
sample_df

## Text Preprocessing Techniques

Outline:

1. Understand the bag of words model
2. Tokenization
3. Stop word removal
4. Stemming

### Bag of Words Intuition

1. Create a list of all the words across all the text documents
2. You convert each document into vector counts of each word


Limitations:
1. There may be too many words in the dataset
2. Some words may occur too frequently
3. Some words may occur very rarely or only once
4. A single word may have many forms (go, gone, going or bird vs. birds)

In [None]:
q0 = sincere_df.question_text.values[1]

In [None]:
q0

In [None]:
q1 = raw_df[raw_df.target == 1].question_text.values[0]

In [None]:
q1

### Tokenization

splitting a document into words and separators

In [None]:
import nltk

In [None]:
from nltk.tokenize import word_tokenize

In [None]:
nltk.download('punkt')

In [None]:
q0

In [None]:
word_tokenize(q0)

In [None]:
word_tokenize(' this is (something) with, a lot of, punctuation;')

In [None]:
q1

In [None]:
word_tokenize(q1)

In [None]:
q0_tok = word_tokenize(q0)
q1_tok = word_tokenize(q1)

### Stop Word Removal

Removing commonly occuring words

In [None]:
q1_tok

In [None]:
from nltk.corpus import stopwords

In [None]:
nltk.download('stopwords')

In [None]:
english_stopwords = stopwords.words('english')

In [None]:
", ".join(english_stopwords)

In [None]:
def remove_stopwords(tokens):
    return [word for word in tokens if word.lower() not in english_stopwords]

In [None]:
q0_tok

In [None]:
q0_stp = remove_stopwords(q0_tok)

In [None]:
q0_stp

In [None]:
q1_stp = remove_stopwords(q1_tok)

In [None]:
q1_tok

In [None]:
q1_stp

### Stemming

"go", "gone", "going" -> "go"
"birds", "bird" -> "bird"

In [None]:
from nltk.stem.snowball import SnowballStemmer

In [None]:
stemmer = SnowballStemmer(language='english')

In [None]:
stemmer.stem('going')

In [None]:
stemmer.stem('supposedly')

In [None]:
q0_stm = [stemmer.stem(word) for word in q0_stp]

In [None]:
q0_stp

In [None]:
q0_stm

In [None]:
q1_stm = [stemmer.stem(word) for word in q1_stp]

In [None]:
q1_stp

In [None]:
q1_stm

### Lemmatization

"love" -> "love"
"loving" -> "love"
"lovable" -> "love"

## Implement Bag of Words


Outline:

1. Create a vocabulary using Count Vectorizer
2. Transform text to vectors using Count Vectorizer
3. Configure text preprocessing in Count Vectorizer

### Create a Vocabulary

In [None]:
sample_df

In [None]:
small_df = sample_df[:5]

In [None]:
small_df

In [None]:
small_df.question_text.values

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
small_vect = CountVectorizer()

In [None]:
small_vect.fit(small_df.question_text)

In [None]:
small_vect.get_feature_names_out()

### Transform documents into Vectors

In [None]:
vectors = small_vect.transform(small_df.question_text)

In [None]:
vectors

In [None]:
vectors.shape

In [None]:
small_df.question_text.values[0]

In [None]:
vectors[0].toarray()

In [None]:
vectors.toarray()

### Configure Count Vectorizer Parameters

In [None]:
stemmer = SnowballStemmer(language='english')

In [None]:
def tokenize(text):
    return [stemmer.stem(word) for word in word_tokenize(text)]

In [None]:
tokenize('What is the really (dealing) here?')

In [None]:
vectorizer = CountVectorizer(lowercase=True,
                             tokenizer=tokenize,
                             stop_words=english_stopwords,
                             max_features=1000)

In [None]:
%%time
vectorizer.fit(sample_df.question_text)

In [None]:
len(vectorizer.vocabulary_)

In [None]:
vectorizer.get_feature_names_out()[:100]

In [None]:
%%time
inputs = vectorizer.transform(sample_df.question_text)

In [None]:
inputs.shape

In [None]:
inputs

In [None]:
sample_df.question_text.values[0]

In [None]:
test_df

In [None]:
%%time
test_inputs = vectorizer.transform(test_df.question_text)

## ML Models for Text Classification

Outline:

- Create a training & validation set
- Train a logistic regression model
- Make predictions on training, validation & test data

### Split into Training and Validation Set

In [None]:
sample_df

In [None]:
inputs.shape

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
train_inputs, val_inputs, train_targets, val_targets = train_test_split(inputs, sample_df.target, test_size=0.3, random_state=42)

In [None]:
train_inputs.shape

In [None]:
train_targets.shape

In [None]:
val_inputs.shape

In [None]:
val_targets.shape

### Train Logistic Regression model

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
MAX_ITER = 1000

In [None]:
model = LogisticRegression(max_iter=MAX_ITER, solver='sag')

In [None]:
%%time
model.fit(train_inputs, train_targets)

### Make predictions using the model

In [None]:
train_preds = model.predict(train_inputs)

In [None]:
train_targets

In [None]:
train_preds

In [None]:
pd.Series(train_preds).value_counts()

In [None]:
pd.Series(train_targets).value_counts()

In [None]:
from sklearn.metrics import accuracy_score

In [None]:
accuracy_score(train_targets, train_preds)

In [None]:
import numpy as np

In [None]:
accuracy_score(train_targets, np.zeros(len(train_targets)))

In [None]:
from sklearn.metrics import f1_score

In [None]:
f1_score(train_targets, train_preds)

In [None]:
f1_score(train_targets, np.zeros(len(train_targets)))

In [None]:
random_preds = np.random.choice((0, 1), len(train_targets))
f1_score(train_targets, random_preds)

In [None]:
val_preds = model.predict(val_inputs)

In [None]:
accuracy_score(val_targets, val_preds)

In [None]:
f1_score(val_targets, val_preds)

In [None]:
sincere_df.question_text.values[:10]

In [None]:
sincere_df.target.values[:10]

In [None]:
model.predict(vectorizer.transform(sincere_df.question_text.values[:10]))

In [None]:
insincere_df.question_text.values[:10]

In [None]:
insincere_df.target.values[:10]

In [None]:
model.predict(vectorizer.transform(insincere_df.question_text.values[:10]))

## Make Predictions and Submit to Kaggle

In [None]:
test_df

In [None]:
test_inputs.shape

In [None]:
test_preds = model.predict(test_inputs)

In [None]:
sub_df

In [None]:
sub_df.prediction = test_preds

In [None]:
sub_df.prediction.value_counts()

In [None]:
sub_df

In [None]:
sub_df.to_csv('submission.csv', index=None)

In [None]:
!head submission.csv