# Stemming & Lemmatization - Text Normalization
### Stemming varies from Lemmatization in the way it produces root forms of words and the word produced.
* Stemming - reduces inflection in words to their root, even if the root (stem) is not an actual word of the language

### Both techniques are used in tagging systems, indexing, SEOs, and web search results
* searching for *fish*, getting results with *fishes, fishing*

In [1]:
import nltk

nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

### With nltk, stemming words can be done with:
* PorterStemmer
* LancasterStemmer

In [2]:
from nltk.stem import PorterStemmer
from nltk.stem import LancasterStemmer

### nltk.stem allows for stemming with different classes

In [3]:
porter = PorterStemmer()
lancaster = LancasterStemmer()

print("Porter Stemmer")
print(porter.stem("cats"))
print(porter.stem("trouble"))
print(porter.stem("troubling"))
print(porter.stem("troubled"))
print()
print("Lancaster Stemmer")
print(lancaster.stem("cats"))
print(lancaster.stem("trouble"))
print(lancaster.stem("troubling"))
print(lancaster.stem("troubled"))

Porter Stemmer
cat
troubl
troubl
troubl

Lancaster Stemmer
cat
troubl
troubl
troubl


* troubl - porter stemmer frequently does not stem real words, generation is not based on linguistics
* it is known for Speed

In [4]:
word_list = ["friend", "friendship", "friends", "friendships","stabil","destabilize",
             "misunderstanding","railroad","moonlight","football"]

print("{0:20}{1:20}{2:20}".format("Word","Porter Stemmer","lancaster Stemmer"))

for word in word_list:
    print("{0:20}{1:20}{2:20}".format(word,porter.stem(word),lancaster.stem(word)))

Word                Porter Stemmer      lancaster Stemmer   
friend              friend              friend              
friendship          friendship          friend              
friends             friend              friend              
friendships         friendship          friend              
stabil              stabil              stabl               
destabilize         destabil            dest                
misunderstanding    misunderstand       misunderstand       
railroad            railroad            railroad            
moonlight           moonlight           moonlight           
football            footbal             footbal             


* dest - lancaster stemmer is a longer, iterative process
* over-stemming can occur

### Sentences can be stemmed

In [5]:
sentence="Pythoners are very intelligent and work very pythonly and now they are pythoning their way to success."

porter.stem(sentence)

'pythoners are very intelligent and work very pythonly and now they are pythoning their way to success.'

* Stemmer thinks the entire sentence is 1 word 

### Break up the sentence using a Tokenizer

In [6]:
from nltk.tokenize import sent_tokenize, word_tokenize

def stemSentence(sentence):
    token_words=word_tokenize(sentence)
    token_words
    stem_sentence=[]
    for word in token_words:
        stem_sentence.append(porter.stem(word))
        stem_sentence.append(" ")
    return "".join(stem_sentence)

x = stemSentence(sentence)
print(x)

python are veri intellig and work veri pythonli and now they are python their way to success . 


### Stem a Document, with the following instructions
* (1) input a document
* (2) read document, line by line
* (3) tokenize a line
* (4) stem words of the line
* (5) output stemmed words
* (6) repeat, line by line

In [7]:
file = open("../../Downloads/sun.txt")

file.read()

"The Sun is the star at the center of the Solar System. It is almost perfectly spherical and consists of hot plasma interwoven with magnetic fields.[12][13] It has a diameter of about 1,392,684 km (865,374 mi),[5] around 109 times that of Earth, and its mass (1.989×1030 kilograms, approximately 330,000 times the mass of Earth) accounts for about 99.86% of the total mass of the Solar System.[14] Chemically, about three quarters of the Sun's mass consists of hydrogen, while the rest is mostly helium. The remaining 1.69% (equal to 5,600 times the mass of Earth) consists of heavier elements, including oxygen, carbon, neon and iron, among others.[15]\n\nThe Sun formed about 4.567 billion[a][16] years ago from the gravitational collapse of a region within a large molecular cloud. Most of the matter gathered in the center, while the rest flattened into an orbiting disk that would become the Solar System. The central mass became increasingly hot and dense, eventually initiating thermonuclear f

* read lines - using readlines( )

In [9]:
file = open("../../Downloads/sun.txt")

my_lines_list = file.readlines()

my_lines_list

["The Sun is the star at the center of the Solar System. It is almost perfectly spherical and consists of hot plasma interwoven with magnetic fields.[12][13] It has a diameter of about 1,392,684 km (865,374 mi),[5] around 109 times that of Earth, and its mass (1.989×1030 kilograms, approximately 330,000 times the mass of Earth) accounts for about 99.86% of the total mass of the Solar System.[14] Chemically, about three quarters of the Sun's mass consists of hydrogen, while the rest is mostly helium. The remaining 1.69% (equal to 5,600 times the mass of Earth) consists of heavier elements, including oxygen, carbon, neon and iron, among others.[15]\n",
 '\n',
 'The Sun formed about 4.567 billion[a][16] years ago from the gravitational collapse of a region within a large molecular cloud. Most of the matter gathered in the center, while the rest flattened into an orbiting disk that would become the Solar System. The central mass became increasingly hot and dense, eventually initiating ther

* tokenize and stem each line - using stemSentence( ), created before

In [10]:
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.stem import PorterStemmer

porter = PorterStemmer()

def stemSentence(sentence):
    token_words=word_tokenize(sentence)
    token_words
    stem_sentence=[]
    for word in token_words:
        stem_sentence.append(porter.stem(word))
        stem_sentence.append(" ")
    return "".join(stem_sentence)

print(my_lines_list[0])
print("Stemmed sentence")

x = stemSentence(my_lines_list[0])

print(x)

The Sun is the star at the center of the Solar System. It is almost perfectly spherical and consists of hot plasma interwoven with magnetic fields.[12][13] It has a diameter of about 1,392,684 km (865,374 mi),[5] around 109 times that of Earth, and its mass (1.989×1030 kilograms, approximately 330,000 times the mass of Earth) accounts for about 99.86% of the total mass of the Solar System.[14] Chemically, about three quarters of the Sun's mass consists of hydrogen, while the rest is mostly helium. The remaining 1.69% (equal to 5,600 times the mass of Earth) consists of heavier elements, including oxygen, carbon, neon and iron, among others.[15]

Stemmed sentence
the sun is the star at the center of the solar system . It is almost perfectli spheric and consist of hot plasma interwoven with magnet field . [ 12 ] [ 13 ] It ha a diamet of about 1,392,684 km ( 865,374 mi ) , [ 5 ] around 109 time that of earth , and it mass ( 1.989×1030 kilogram , approxim 330,000 time the mass of earth ) a

* save stemmed sentences to a txt file - using write( ) 

In [11]:
stem_file=open("../../Downloads/sun.txt",mode="a+", encoding="utf-8")

for line in my_lines_list:
    stem_sentence=stemSentence(line)
    stem_file.write(stem_sentence)

stem_file.close()