# NLP with Bag of Words

<img src="https://aiml.com/wp-content/uploads/2023/02/disadvantage-bow-1024x650.png" height=500 width=500>

Implementing NLP models involves a systematic approach that can be divided into several key steps. Each of these steps plays a crucial role in ensuring the success of the NLP project. Here, we delve into the description of each step, explaining their importance and how they contribute to the overall NLP model development process.

#### 1. Data Collection

**Description:** Data collection is the foundational step where you gather the raw material for your NLP project. The nature of your project determines the type of data you need. This could range from social media posts, news articles, and books, to transcripts of spoken language. The data can be collected through various means, including APIs, web scraping, public datasets, or proprietary sources.

**Importance:**

- Determines the scope and potential of your NLP model.
- Impacts the accuracy and reliability of the model outcomes.
- Ensures diversity and representativeness in the dataset, which helps in building robust models.

#### 2. Data Preprocessing

**Description:** Data preprocessing involves cleaning and preparing your text data for modeling. This step may include several tasks such as tokenization (breaking text into tokens or words), removing stop words (common words that add little value), case normalization (converting all text to the same case), stemming and lemmatization (reducing words to their base or root form), and handling of missing or special characters.

**Importance:**

- Improves model efficiency by eliminating irrelevant information.
- Helps in standardizing the text data, making it more suitable for feature extraction and modeling.
- Reduces computational complexity, leading to faster training times.

#### 3. Feature Extraction

**Description:** Feature extraction converts text into a numerical format that can be used by machine learning algorithms. Common techniques include Bag of Words (BoW), Term Frequency-Inverse Document Frequency (TF-IDF), and word embeddings (like Word2Vec or GloVe). BoW and TF-IDF focus on the frequency of words, whereas embeddings provide a dense representation capturing semantic relationships between words.

**Importance:**

- Transforms text into a structured form that models can understand and learn from.
- Captures important characteristics of the text, such as context or semantic meaning.
- Determines the dimensionality of the input data, which can affect model performance and complexity.

#### 4. Model Selection

**Description:** Model selection involves choosing the appropriate algorithm or approach for your NLP task. This could range from traditional machine learning models like Logistic Regression, Naive Bayes, and Random Forests, etc. 

**Importance:**

- Influences the accuracy, efficiency, and scalability of your NLP solution.
- Depends on the nature of the task (e.g., classification, translation, generation) and the complexity of the data.
- Affects the interpretability of the model and its outcomes.

#### 5. Model Training

**Description:** Model training is the process of feeding the preprocessed and vectorized text data into the chosen model, allowing it to learn from the data. This step involves setting parameters, choosing an optimizer, and defining a loss function. The model iteratively adjusts its weights based on the input data and the feedback from the loss function until it performs optimally.

**Importance:**

- Directly affects the model's ability to make accurate predictions or generate coherent text.
- Requires careful tuning of hyperparameters to balance between underfitting and overfitting.
- Involves validation techniques such as cross-validation or split validation to gauge model performance on unseen data.

#### 6. Evaluation and Tuning

**Description:** After training, the model is evaluated using metrics appropriate to the specific NLP task, such as accuracy, precision, recall, F1 score for classification tasks, or BLEU score for translation tasks. Based on these metrics, further tuning of the model's parameters or architecture might be necessary to improve performance.

**Importance:**

- Provides insight into how well the model performs on unseen data.
- Helps identify biases, underfitting, or overfitting within the model.
- Essential for optimizing the model to achieve the best balance between performance and generalization.


Throughout these steps, collaboration between domain experts, data scientists, and engineers is crucial for addressing the challenges unique to NLP projects, from understanding the nuances of language to deploying scalable NLP solutions.

#### 1. Data Collection
Data collection is the first step, where you gather the textual data you'll use for your model. This can involve scraping websites, using APIs, or accessing pre-existing datasets.

In [70]:
import opendatasets as od
import pandas as pd

In [71]:
od.download('https://www.kaggle.com/datasets/andrewmvd/cyberbullying-classification')

Skipping, found downloaded files in "./cyberbullying-classification" (use force=True to force download)


In [72]:
df = pd.read_csv('cyberbullying-classification/cyberbullying_tweets.csv')

In [73]:
df

Unnamed: 0,tweet_text,cyberbullying_type
0,"In other words #katandandre, your food was cra...",not_cyberbullying
1,Why is #aussietv so white? #MKR #theblock #ImA...,not_cyberbullying
2,@XochitlSuckkks a classy whore? Or more red ve...,not_cyberbullying
3,"@Jason_Gio meh. :P thanks for the heads up, b...",not_cyberbullying
4,@RudhoeEnglish This is an ISIS account pretend...,not_cyberbullying
...,...,...
47687,"Black ppl aren't expected to do anything, depe...",ethnicity
47688,Turner did not withhold his disappointment. Tu...,ethnicity
47689,I swear to God. This dumb nigger bitch. I have...,ethnicity
47690,Yea fuck you RT @therealexel: IF YOURE A NIGGE...,ethnicity


In [75]:
df['cyberbullying_type'].value_counts()

religion               7998
age                    7992
gender                 7973
ethnicity              7961
not_cyberbullying      7945
other_cyberbullying    7823
Name: cyberbullying_type, dtype: int64

#### 2. Data Preprocessing

Preprocessing involves cleaning and preparing the text data for modeling. This may include tokenization, removing stop words, stemming, and lemmatization.

**Example Code:** Tokenization, removing stop words & stemming using NLTK.

In [36]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

nltk.download('punkt')
nltk.download('stopwords')

# Example text
text = "I am in Hyderabad. She picked me up from the airport."
tokens = word_tokenize(text)
tokens

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/samanvitha/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/samanvitha/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


['I',
 'am',
 'in',
 'Hyderabad',
 '.',
 'She',
 'picked',
 'me',
 'up',
 'from',
 'the',
 'airport',
 '.']

In [76]:
#stopwords.words('english')

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [None]:
if normalized_word is NOT in stop_words, can it be a meaningful word?

In [77]:
text

'I am in Hyderabad. She picked me up from the airport.'

In [82]:
filtered_tokens = [word for word in tokens if word.lower() not in stopwords.words('english')]
filtered_tokens

['Hyderabad', '.', 'picked', 'airport', '.']

In [83]:
from nltk.stem.snowball import SnowballStemmer

In [84]:
stemmed_words = [SnowballStemmer(language='english').stem(word) for word in filtered_tokens]

In [85]:
stemmed_words

['hyderabad', '.', 'pick', 'airport', '.']

**Explanation:**
In this example, we first tokenize a sample sentence into words using NLTK's word_tokenize function. Next, we filter out stop words (common words like "is", "an", which are typically removed in NLP tasks) from the tokenized words and perform stemming. This preprocessing step is crucial for reducing the size of the dataset and focusing on the words that carry more meaning for analysis or modeling tasks.

#### 3. Feature Extraction
Feature extraction involves converting text data into numerical formats that machine learning models can work with, such as using Bag of Words, TF-IDF, or word embeddings.

**Example Code:** Creating a CountVectorizer with scikit-learn.

In [86]:
# Tokenization function
def tokenize(text):
    tokens = word_tokenize(text)
    return tokens

# Stopwords removal
stop_words = set(stopwords.words('english'))

# Stemming function
stemmer = SnowballStemmer(language='english')
def stem_tokens(tokens, stemmer):
    stemmed = [stemmer.stem(token) for token in tokens]
    return stemmed

# Preprocessing function
def preprocess_text(text):
    tokens = tokenize(text)
    filtered_tokens = [word.lower() for word in tokens if word.lower() not in stop_words]
    stemmed_tokens = stem_tokens(filtered_tokens, stemmer)
    return ' '.join(stemmed_tokens)

# Apply preprocessing to the tweet_text column
df['preprocessed_text'] = df['tweet_text'].apply(preprocess_text)

In [88]:
df

Unnamed: 0,tweet_text,cyberbullying_type,preprocessed_text
0,"In other words #katandandre, your food was cra...",not_cyberbullying,"word # katandandr , food crapilici ! # mkr"
1,Why is #aussietv so white? #MKR #theblock #ImA...,not_cyberbullying,# aussietv white ? # mkr # theblock # imaceleb...
2,@XochitlSuckkks a classy whore? Or more red ve...,not_cyberbullying,@ xochitlsuckkk classi whore ? red velvet cupc...
3,"@Jason_Gio meh. :P thanks for the heads up, b...",not_cyberbullying,"@ jason_gio meh . : p thank head , concern ano..."
4,@RudhoeEnglish This is an ISIS account pretend...,not_cyberbullying,@ rudhoeenglish isi account pretend kurdish ac...
...,...,...,...
47687,"Black ppl aren't expected to do anything, depe...",ethnicity,"black ppl n't expect anyth , depend anyth . ye..."
47688,Turner did not withhold his disappointment. Tu...,ethnicity,turner withhold disappoint . turner call court...
47689,I swear to God. This dumb nigger bitch. I have...,ethnicity,swear god . dumb nigger bitch . got bleach hai...
47690,Yea fuck you RT @therealexel: IF YOURE A NIGGE...,ethnicity,yea fuck rt @ therealexel : your nigger fuck u...


In [87]:
df['preprocessed_text']

0               word # katandandr , food crapilici ! # mkr
1        # aussietv white ? # mkr # theblock # imaceleb...
2        @ xochitlsuckkk classi whore ? red velvet cupc...
3        @ jason_gio meh . : p thank head , concern ano...
4        @ rudhoeenglish isi account pretend kurdish ac...
                               ...                        
47687    black ppl n't expect anyth , depend anyth . ye...
47688    turner withhold disappoint . turner call court...
47689    swear god . dumb nigger bitch . got bleach hai...
47690    yea fuck rt @ therealexel : your nigger fuck u...
47691    bro . u got ta chill rt @ chillshrammi : dog f...
Name: preprocessed_text, Length: 47692, dtype: object

In [None]:
This tweet is useless. I am going to delete it.

In [None]:
['This', 'tweet','is','useless','.','I','am','going','to','delete','it','.']

In [None]:
['tweet', '.', 'useless', 'going', 'delete', '.']

In [None]:
'tweet . useless go delete .'

tweet useless go delete



In [64]:
from sklearn.feature_extraction.text import CountVectorizer

# Initialize a count Vectorizer
vectorizer = CountVectorizer()
count_matrix = vectorizer.fit_transform(df['preprocessed_text'])

# Output vocabulary and count vector matrix
print("Count Matrix:\n", count_matrix.toarray())

Count Matrix:
 [[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]


In [65]:
count_matrix.shape

(47692, 50935)

#### 4. Model Building

Model building involves selecting a model (like Logistic Regression, Naive Bayes, etc.) and using the preprocessed and vectorized data to train this model.

**Example Code:** Training a Logistic Regression with scikit-learn.

In [91]:
df['cyberbullying_type']

0        not_cyberbullying
1        not_cyberbullying
2        not_cyberbullying
3        not_cyberbullying
4        not_cyberbullying
               ...        
47687            ethnicity
47688            ethnicity
47689            ethnicity
47690            ethnicity
47691            ethnicity
Name: cyberbullying_type, Length: 47692, dtype: object

In [93]:
count_matrix.toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [None]:
train_df, test_df = train_test_split(raw_df, test_size=0.2, random_state=45)

In [None]:
train_df, val_df = train_test_split(train_df, test_size=0.25, random_state=65)

In [None]:
train_inputs = train_df[inputs]
train_target = train_df[output]

In [None]:
model.fit(train_inputs, train_target)

In [None]:
train_input, test_input, train_target, test_target = train_test_split(count_matrix, labels, test_size=0.2,random_state=24)

In [None]:
y = ma


for m = 60, what is force? 

y = ma = 60*9.8

In [None]:
mass of 78

78*9.8 or 60*9.8

In [None]:
test set 90,65, 74, etc..

90*9.8, 65*9.8, 74*9.8...

In [None]:
raw_df


train_df - the data on which model trains

val_df - well - 

test_df -

In [97]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

labels = df['cyberbullying_type']

# Split data into training and test sets
X_train_val, X_test, y_train_val, y_test = train_test_split(count_matrix, labels, test_size=0.3, random_state=42)

X_train, X_val, y_train, y_val = train_test_split(X_train_val, y_train_val, test_size=0.25, random_state=54)

# Initialize and train a Logistic Regression
clf = LogisticRegression(max_iter=1000)
clf.fit(X_train, y_train)

# Predict on the validation test and calculate accuracy 
val_preds = clf.predict(X_val)
val_accuracy = accuracy_score(y_val, val_preds)
print(f"Validation Accuracy: {val_accuracy}")

# Predict on the test set and calculate accuracy
predictions = clf.predict(X_test)
accuracy = accuracy_score(y_test, predictions)
print(f"Accuracy: {accuracy}")

Validation Accuracy: 0.8274622573687994
Accuracy: 0.8261112664243779


In [None]:
model = LogisticRegression()

In [None]:
model.fit(train_input, train_target)

In [None]:
preds = model.predict(test_input)

In [None]:
accuracy_score(test_target,preds)

**Explanation:**
This snippet demonstrates splitting the dataset into training and test sets, then initializing and training a Logistic Regression using the training data. After training, the model predicts labels for the test set. The accuracy of these predictions is then calculated against the true labels. Logistic Regression is a simple and effective baseline for text classification tasks.

#### 5. Evaluation
Finally, the model's performance is evaluated using metrics like accuracy, precision, recall, F1 score, depending on the specific task (classification, regression, etc.).

**Example Code:** Continued from model building, adding precision and recall.

In [67]:
from sklearn.metrics import accuracy_score, classification_report

In [68]:
from sklearn.metrics import precision_score, recall_score

# Calculate precision and recall
precision = precision_score(y_test, predictions,average=None)
recall = recall_score(y_test, predictions,average=None)

print(f"Precision: {precision}")
print(f"Recall: {recall}")

Precision: [0.97079983 0.98786611 0.89968369 0.58445946 0.58227375 0.95869565]
Recall: [0.97658578 0.9764268  0.84976526 0.56123277 0.65119197 0.94190517]


In [69]:
print("\nClassification Report:\n", classification_report(y_test, predictions))


Classification Report:
                      precision    recall  f1-score   support

                age       0.97      0.98      0.97      2349
          ethnicity       0.99      0.98      0.98      2418
             gender       0.90      0.85      0.87      2343
  not_cyberbullying       0.58      0.56      0.57      2466
other_cyberbullying       0.58      0.65      0.61      2391
           religion       0.96      0.94      0.95      2341

           accuracy                           0.82     14308
          macro avg       0.83      0.83      0.83     14308
       weighted avg       0.83      0.82      0.83     14308



**Explanation:**
Following the model prediction, we calculate precision (the ratio of correctly predicted positive observations to the total predicted positives) and recall (the ratio of correctly predicted positive observations to the all observations in actual class). These metrics provide a more nuanced view of the model's performance than accuracy alone, especially in imbalanced datasets where positive and negative classes are not represented equally.