# Basic NLP Course

## Importance of Removing Stop Words

In Natural Language Processing (NLP), **stop words** are common words that occur frequently in a language but carry little to no meaningful information for tasks like text analysis or classification. Examples of stop words in English include "the," "is," "in," "and," etc.

### Concept of Stop Words

- **Definition**: Stop words are words that are often filtered out during preprocessing because they do not contribute significantly to the meaning or context of the text.
- **Examples**: Words like "a," "an," "the," "of," "on," "and," "it," etc.

### Importance of Removing Stop Words
1. **Reduces Noise**: Stop words can introduce noise in the data, making it harder for models to focus on the meaningful content.
2. **Improves Efficiency**: By removing stop words, the dimensionality of the data is reduced, leading to faster processing and reduced computational overhead.
3. **Enhances Model Performance**: Eliminating irrelevant words allows machine learning models to focus on the words that carry more semantic weight, improving the accuracy of tasks like classification, clustering, and sentiment analysis.
4. **Simplifies Vocabulary**: Removing stop words reduces the size of the vocabulary, making the Bag of Words or other representations more concise and manageable.

### Example
Consider the sentence:
- "The cat is on the mat."

After removing stop words ("the," "is," "on"), the sentence becomes:
- "cat mat"

This simplified version retains the core meaning while reducing unnecessary words, making it easier for models to process and analyze.

In [6]:
import spacy
import pandas as pd
from spacy.lang.en.stop_words import STOP_WORDS

In [7]:
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m20.2 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hInstalling collected packages: en-core-web-sm
Successfully installed en-core-web-sm-3.8.0
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m25.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [8]:
data = pd.read_csv('../data/work_orders_sample.csv')
data.head()

Unnamed: 0,failure_mode,description
0,Internal leakage,Compressor CP-001 is experiencing internal lea...
1,Abnormal instrument reading,Compressor CP-101 is showing abnormal pressure...
2,Abnormal instrument reading,Compressor C-101 is giving an abnormal high pr...
3,Abnormal instrument reading,Compressor C-101-A is giving abnormal instrume...
4,Abnormal instrument reading,Compressor CP-101 is giving an abnormal instru...


In [9]:
doc_example = data['description'].sample(1).iloc[0]
print(doc_example)

Compressor C-101 has performance issues due to a structural deficiency, likely caused by vibration, which has resulted in material damage to the impeller, requiring replacement to restore normal operation.


In [10]:
nlp = spacy.load('en_core_web_sm')
doc = nlp(doc_example)

for token in doc:
    if token.is_stop:
        print(token.text)

has
due
to
a
by
which
has
in
to
the
to


In [11]:
def preprocess_text(text):
    doc = nlp(text)
    # removing stop words and punctuation
    tokens = [token.text for token in doc if not token.is_stop and not token.is_punct]

    # stemming or lemmatization can be added here if needed
    lemmatized_tokens = [token.lemma_ for token in nlp(' '.join(tokens))]

    return ' '.join(lemmatized_tokens)

In [12]:
# apply the preprocessing function to the 'description' column
data['cleaned_description'] = data['description'].apply(preprocess_text)
data[['description', 'cleaned_description']].head(10)

Unnamed: 0,description,cleaned_description
0,Compressor CP-001 is experiencing internal lea...,Compressor CP-001 experience internal leakage ...
1,Compressor CP-101 is showing abnormal pressure...,Compressor CP-101 show abnormal pressure readi...
2,Compressor C-101 is giving an abnormal high pr...,Compressor C-101 give abnormal high pressure r...
3,Compressor C-101-A is giving abnormal instrume...,Compressor C-101 give abnormal instrument read...
4,Compressor CP-101 is giving an abnormal instru...,Compressor CP-101 give abnormal instrument rea...
5,Compressor C-101A is showing abnormal instrume...,Compressor C-101A show abnormal instrument rea...
6,Compressor CP-001 is indicating abnormal instr...,Compressor CP-001 indicate abnormal instrument...
7,Compressor C-101A is giving an abnormal instru...,Compressor C-101A give abnormal instrument rea...
8,Compressor C-101-A is showing abnormal instrum...,Compressor C-101 show abnormal instrument read...
9,Compressor C-101 is giving abnormal instrument...,Compressor C-101 give abnormal instrument read...
