# Text Mining

# NLP
Natural language processing (NLP) is a field located at the intersection of data science and Artificial Intelligence (AI) that – when boiled down to the basics – is all about teaching machines how to understand human languages and extract meaning from text.

## Applications:
1. Document Classfication
2. Review Analysis - Sentiment Analysis
3. Search Engines
4. Machine Translation
5. Talker Bots
6. Spell Correction
7. Summarization
8. Machine Conversation
9. Spam Detection
10. Name Entity Recognition

## 8 best Python Natural Language Processing (NLP) libraries:
1. **Natural Language Toolkit (NLTK):**\
https://www.nltk.org/ \
NLTK is an essential library supports tasks such as classification, stemming, tagging, parsing, semantic reasoning, and tokenization in Python. (Book: https://www.nltk.org/book/)

2. **TextBlob:** \
https://textblob.readthedocs.io/en/dev/ \
TextBlob is a must for developers who are starting their journey with NLP in Python and want to make the most of their first encounter with NLTK.

3. **CoreNLP:** \
https://stanfordnlp.github.io/CoreNLP/ \
This library was developed at Stanford University and it’s written in Java. Still, it’s equipped with wrappers for many different languages, including Python.

4. **Gensim:** \
https://github.com/RaRe-Technologies/gensim \
Gensim is a Python library that specializes in identifying semantic similarity between two documents through vector space modeling and topic modeling toolkit.

5. **spaCy:** \
https://spacy.io/ \
spaCy is a relatively young library was designed for production usage. That’s why it’s so much more accessible than other Python NLP libraries like NLTK.

6. **polyglot:** \
https://polyglot.readthedocs.io/en/latest/index.html \
This slightly lesser-known library is one of our favorites because it offers a broad range of analysis and impressive language coverage. Thanks to NumPy, it also works really fast.

7. **scikit-learn:** \
https://scikit-learn.org/ \
This handy NLP library provides developers with a wide range of algorithms for building machine learning models. It offers many functions for using the bag-of-words method of creating features to tackle text classification problems.

8. **Pattern:** \
https://www.clips.uantwerpen.be/clips.bak/pages/pattern \
Another gem in the NLP libraries Python developers use to handle natural languages. Pattern allows part-of-speech tagging, sentiment analysis, vector space modeling, SVM, clustering, n-gram search, and WordNet. 

# Text Word-level Representation (Word Embedding)

[Watch YouTube Videos for details](https://www.youtube.com/channel/UC3d1uzFtJxqPsirAc48zPEA) \

1. **One-hot Encoding:** \
A one hot encoding is a representation of categorical variables as binary vectors. Each integer value is represented as a binary vector that is all zero values except the index of the integer, which is marked with a 1.
2. **Bag-Of-Words:** \
In this model, a text (such as a sentence or a document) is represented as the bag (multiset) of its words, disregarding grammar and even word order but keeping multiplicity.

3. **Word-Embedding:** \
In the form of a real-valued vector that encodes the meaning of the word such that the words that are closer in the vector space are expected to be similar in meaning.
4. **TF-IDF:** \
This algorithm is widely used in the search technologies. Tf-Idf stands for Term frequency-Inverse document frequency.
5. **Word2Vec:**\
The word2vec algorithm uses a neural network model to learn word associations from a large corpus of text. Once trained, such a model can detect synonymous words or suggest additional words for a partial sentence.

## NLTK

[https://www.nltk.org/](https://www.nltk.org/)

*   NLTK is a leading platform for building Python programs to work with human language data.
*   It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet
* text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries

In [None]:
%pip install nltk

: 

In [None]:
# !pip install nltk
import nltk

In [None]:
print (nltk.__version__)

In [None]:
#PUNKT is a pre-trained unsupervised ML model that is a sentense tokenizer
#Install PUNKT
nltk.download ('punkt')

In [None]:
#Sentence Tokenization
from nltk.tokenize import sent_tokenize
test_text = 'I learn NLP. I learn Python. Its user friendly. I am ready.'
sent_tokenize(test_text)

In [None]:
test2 = 'سلام! اسم من رضا هست. حالتون چطوره؟'
sent_tokenize (test2)

In [None]:
!gdown --id 1oVyJvIIXM7eHBEMC_N-fxH9aaAUGiTL5

In [None]:
#open a text file
test_file = open("smaple_text.txt", mode='r')

### mode
'r'	: Open for text file for reading text \
'w'	: Open a text file for writing text \
'a'	: Open a text file for appending text\

In [None]:
text_read = test_file.read()
print(text_read)

In [None]:
len(text_read) #the number of charachters

In [None]:
import nltk.data
Punkt_tok = nltk.data.load('nltk:tokenizers/punkt/english.pickle')
Punkt_tok.tokenize(text_read)

In [None]:
len(Punkt_tok.tokenize(text_read))

### We can train our tokenizer based on our text

[Webtext (corpus)](https://paperswithcode.com/dataset/webtext) \
WebText is an internal OpenAI corpus created by scraping web pages with emphasis on document quality. 

In [None]:
import nltk
nltk.download('webtext')

In [None]:
from nltk.corpus import webtext
text_parameter = webtext.raw('overheard.txt')
print(text_parameter) # it is a play

In [None]:
#Train my tokenizer
from nltk.tokenize import PunktSentenceTokenizer
My_tokenizer = PunktSentenceTokenizer(text_parameter)

In [None]:
type(My_tokenizer)

In [None]:
from nltk.tokenize import sent_tokenize    # to compare two methods
pre_token = sent_tokenize(text_parameter)
our_token = My_tokenizer.tokenize(text_parameter)

In [None]:
pre_token[0]

In [None]:
our_token[0]

## Word Tokenization

In [None]:
from nltk.tokenize import word_tokenize
word_tokenize(test_text)

In [None]:
word_tokenize("don't")

### TreebankWordTokenize

In [None]:
from nltk import TreebankWordTokenizer
Tree_Toknizer = TreebankWordTokenizer()  # Create an object
Tree_Toknizer.tokenize("Hello! Mr reza. How are you today? I can't stand") # the same problem

### WordPunktTokenizer

In [None]:
from nltk.tokenize import WordPunctTokenizer
Punkt_token = WordPunctTokenizer()
Punkt_token.tokenize("can't")