## Preprocessing:
    - Tokenization
    - Lexical Analysis
    - Stop word Elimination
    - Stemming
    - Lemmatization
    - Vectorization
		- TF-IDF
		- WordEmbedding
			- BOW
			- CBOW
			- Word2Vec
			- GloVe
    - Feature Selection
        - Statistical Approaches
            - Term Entropy Table (Select n terms with highest weight w(t): 1-entropy(t)) 
              entropy=Σd=1 to |D| -p(t)logp(t) base |D| with p(t) = tf t,D / Σd'∈ D tf t,d'
            - tf-idf
        - Semantical Approaches
            - Named Entity Recognition approach using POS tagging 
            (EXtract Nouns with more semantic meaning i.e group nearby nouns that forms a concept)
    - Feature Extraction
        - Term Clustering
        - Latent Semantic Indexing

### Tokenization

In [5]:
#Word tokenizer
import nltk
from nltk.tokenize import word_tokenize
sentence = "hi there. hello girl"
tokens = word_tokenize(sentence)
print(tokens)
word_tokenize('won’t')

['hi', 'there', '.', 'hello', 'girl']


['won', '’', 't']

In [7]:
#WordPunctTokenizer to handle punctuations

from nltk.tokenize import WordPunctTokenizer
tokenizer = WordPunctTokenizer()
tokenizer.tokenize(" I can't allow you to go home early")

['I', 'can', "'", 't', 'allow', 'you', 'to', 'go', 'home', 'early']

In [10]:
#Sentence tokenization
from nltk.tokenize import sent_tokenize
sent_tokens = sent_tokenize("hi there. hello girl. can't")
print(sent_tokens)

['hi there.', 'hello girl.', "can't"]


In [12]:
#Tokenization using regular expressions

#customizable tokenisation, preferable for faster execution
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer("[\w']+") 
#matching alphanumeric tokens plus single quotes so that we don’t split contractions like “won’t”
print(tokenizer.tokenize("won't is a contraction."))
print(tokenizer.tokenize("can't is a contraction."))

["won't", 'is', 'a', 'contraction']
["can't", 'is', 'a', 'contraction']


In [19]:
#Tokenization using regular expressions with gaps=True

import nltk
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer('/s+' , gaps=True) #tokenize on whitespace, 
tokenizer.tokenize("won't is a contraction.")
# gaps = True means the pattern is going to identify the gaps to tokenize on.

["won't is a contraction."]

In [14]:
#Tokenization using regular expressions with gaps=False
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer('/s+' , gaps=False)
# if we will use gaps = False parameter then the pattern would be used to identify the tokens
tokenizer.tokenize("won't is a contraction.")

[]

In [21]:
# Training own sentence tokenizer

#for a text that is having a unique formatting.
#To tokenize such text and get best results, we should train our own sentence tokenizer

from nltk.tokenize import PunktSentenceTokenizer
from nltk.corpus import webtext
text = webtext.raw('E:\Learning\ML\Learning Practice\sampletext.txt') #getting the raw text
sent_tokenizer = PunktSentenceTokenizer(text)
sents_1 = sent_tokenizer.tokenize(text)
print(sents_1)
print(sents_1[0])

['Lorem ipsum dolor sit amet, consectetur adipiscing elit.', 'Praesent sit amet elementum mauris.', 'Curabitur finibus, velit eget lacinia tincidunt, tortor nisi ornare justo, in auctor enim nisi eu nulla.', 'Sed porta nibh vitae ante lobortis tempus.', 'Nullam auctor orci vitae volutpat venenatis.', 'Sed tristique lacus nisi, vitae faucibus erat mollis id.', 'Vivamus ac felis malesuada, interdum erat quis, mattis lorem.', 'Sed laoreet ut quam sed egestas.', 'Suspendisse potenti.', 'Nunc lacinia eros id quam ultricies, semper hendrerit lectus suscipit.', 'Maecenas eget orci purus.', 'Praesent diam quam, finibus ac viverra laoreet, volutpat vitae lectus.', 'Ut maximus magna leo, eu tincidunt nisl mattis non.', 'Vestibulum vitae nisl a ipsum eleifend malesuada.', 'Praesent porta, lectus a vulputate sodales, lorem ante venenatis nibh, a ultricies nisi erat pulvinar enim.', 'Quisque id eros sit amet risus hendrerit imperdiet.', 'Donec auctor mattis enim ut aliquam.', 'Maecenas et diam sit 

### Stemming
    It is the process of producing morphological variants of a root/base word.
	Stemming is a technique used to extract the base form of the words by removing affixes from them.
	Reduce words to their base/root form. eg: “chocolates”, “chocolatey”, “choco” to the root word, “chocolate”
    “retrieval”, “retrieved”, “retrieves” reduce to the stem “retrieve”. 
    The input to the stemmer is tokenized words.
    Why stemming? - "normalize text and make it easier to process" 
    Search engines use stemming for indexing the words.
    That’s why rather than storing all forms of a word, a search engine can store only the stems. 
    In this way, stemming reduces the size of the index and increases retrieval accuracy
    but could suffer with info loss, understemming, overstemming

In [22]:
# PorterStemmer, LancasterStemmer
import nltk
from nltk.stem import PorterStemmer, LancasterStemmer
word_stemmer = PorterStemmer() #LancasterStemmer()
word_stemmer.stem('chocolates')	

'chocol'

In [23]:
# SnowballStemmer
#It supports 15 non-English languages. In order to use this steaming class, 
#we need to create an instance with the name of the language we are using and then call the stem() method. 
import nltk
from nltk.stem import SnowballStemmer
print(SnowballStemmer.languages)
French_stemmer = SnowballStemmer('french')
French_stemmer.stem('Bonjoura')

('arabic', 'danish', 'dutch', 'english', 'finnish', 'french', 'german', 'hungarian', 'italian', 'norwegian', 'porter', 'portuguese', 'romanian', 'russian', 'spanish', 'swedish')


'bonjour'

In [24]:
#Regular expression Stemmer
#With the help of this stemming algorithm, we can construct our own stemmer.
#It basically takes a single regular expression and removes any prefix or suffix that matches the expression
import nltk
from nltk.stem import RegexpStemmer
Reg_stemmer = RegexpStemmer('ing')
Reg_stemmer.stem('ingeat')

'eat'

### Lemmatization
	Lemmatization technique is like stemming. The output we will get after lemmatization is called ‘lemma’, which is a root word rather than root stem, the output of stemming.
			
    After lemmatization, we will be getting a valid word. In simple words, stemming technique only looks at the form of the word whereas lemmatization technique looks at the meaning of the word. 
	
    It means after applying lemmatization, we will always get a valid word.
				
			- wordnet for lemmatization and word look up
				
		

In [25]:
#Lemmatization
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
lemmatizer.lemmatize('books')  

'book'

### Stop word elimination

In [26]:
#Stop word removal 
from nltk.corpus import stopwords
stopwords.fileids() #gets the supported languages
english_stops = set(stopwords.words('english'))
words = ['I', 'am', 'a', 'writer']
[word for word in words if word not in english_stops]

['I', 'writer']