# TEXT PREPROCESSING

## stemming
>Stemming is a text normalization process in natural language processing (NLP) and information retrieval. It involves reducing words to their base or root form, known as the `"stem."` The goal of stemming is to group together words derived from the same root, even if they have different inflections or suffixes.

For example:

- "running" is stemmed to "run"      
- "better" is stemmed to "better"      
- "cats" is stemmed to "cat"     

>Stemming is particularly useful when you want to analyze the meaning of words in a text without being concerned with variations due to tense, plurality, or other grammatical forms. There are different stemming algorithms available, and one commonly used algorithm is the `Porter stemming algorithm`.

>you can perform stemming using various modules, such as the `PorterStemmer`.

>example    
input
```python
from nltk.stem import PorterStemmer

# Create a stemmer
porter_stemmer = PorterStemmer()

# Example words
words = ["running", "better", "cats"]

# Stem the words
stemmed_words = [porter_stemmer.stem(word) for word in words]

# Print the results
print(stemmed_words)
```

output

```python
 ['run', 'better', 'cat']
  ```

## classification problem
### comments of product is a positive review or negative review
> Review -----> [eating , eat , eaten]--> stem word is `eat`   
> [GO, Gone ,Going ]--> stem word is `Go`

# Use case

In [19]:
words = [ "eating", "eats", "eaten", "writing", "writes", "programming","finally","finalize", "programs","history"]

### PorterStemmer

In [20]:
from nltk.stem import PorterStemmer

In [21]:
stemming = PorterStemmer()

In [26]:
for word in words:
    print(word+" ---> "+stemming.stem(word))

eating ---> eat
eats ---> eat
eaten ---> eaten
writing ---> write
writes ---> write
programming ---> program
finally ---> final
finalize ---> final
programs ---> program
history ---> histori


In [23]:
stemming.stem('congratulations')

'congratul'

In [24]:
stemming.stem("sitting")

'sit'

In [25]:
stemming.stem("hugging")

'hug'

### RegexpStemmer class 

In NLTK, `the RegexpStemmer` class is part of the stem module and provides a way to perform stemming using regular expressions. Unlike the more common algorithms like the Porter stemming algorithm, the `RegexpStemmer` allows you to define your own rules for stemming based on regular expressions.

```python
from nltk.stem import RegexpStemmer

# Define a regular expression pattern for stemming
pattern = r'ing$|s$|ed$'

# Create a RegexpStemmer with the defined pattern
regexp_stemmer = RegexpStemmer(pattern)

# Example words
words = ["running", "better", "cats"]

# Stem the words using the regular expression pattern
stemmed_words = [regexp_stemmer.stem(word) for word in words]

# Print the results
print(stemmed_words)


>>>output = ['runn', 'better', 'cat']

```

>In this example, the regular expression pattern r'ing$|s$|ed$' specifies three different rules:

- Words ending with "ing" will have "ing" removed.   
- Words ending with "s" will have "s" removed.   
- Words ending with "ed" will have "ed" removed.    

>Keep in mind that using regular expressions for stemming allows for more flexibility and customization, but it also requires careful crafting of the patterns to avoid over-stemming or under-stemming. The choice of stemming method, whether using a pre-defined algorithm or regular expressions, depends on the specific requirements of your NLP task.

# Programmming

In [28]:
from nltk.stem import RegexpStemmer

In [42]:
reg_stemmer = RegexpStemmer('ing$|s$|e$|able$', min=4)

In [35]:
reg_stemmer.stem('eating')

'eat'

In [36]:
reg_stemmer.stem('eats')

'eat'

In [37]:
reg_stemmer.stem('eate')

'eat'

In [38]:
reg_stemmer.stem('eatable')

'eat'

In [39]:
reg_stemmer.stem('ingeating')

'ingeat'

In [41]:
reg_stemmer.stem('ingeating') # when $ is removed from back of ing$

'eat'

### SnowballStemmer

In [44]:
from nltk.stem import SnowballStemmer


In [48]:
snowballstemmer = SnowballStemmer('english')

In [50]:
for word in words:
    print(word+" ----> "+ snowballstemmer.stem(word))

eating ----> eat
eats ----> eat
eaten ----> eaten
writing ----> write
writes ----> write
programming ----> program
finally ----> final
finalize ----> final
programs ----> program
history ----> histori


In [54]:
stemming.stem("fairly"), stemming.stem("supportingly")

('fairli', 'supportingli')

In [53]:
snowballstemmer.stem("fairly"), snowballstemmer.stem("supportingly")

('fair', 'support')

In [55]:
snowballstemmer.stem("goes")

'goe'

## some problem with all the stemming techniques

# So?

# lets lemmetize !!