# Natural Language Processing:

Mining of unstructured text data to make sense out of them using statistical and ML approaches
1. Sentiment analysis
2. Topic modeling
3. Text classification
4. Sentence segemntation or POS tagging

Tools:
    NLTK: Suite of open-source tools created to make NLP processes in python easier to build. Originally created in 2001 at University of Pennsylvania

Install NLTK from http://www.nltk.org/install.html: run pip install --user -U nltk

Install Numpy (optional): run pip install --user -U numpy

Test installation: run python then type import nltk

In [3]:
import nltk
nltk.download() #nltk functions and corpuses are downloaded local to your computer

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

In [4]:
#Check what the package containes (attributes, methods, functions etc.)
dir(nltk)

['AbstractLazySequence',
 'AffixTagger',
 'AlignedSent',
 'Alignment',
 'AnnotationTask',
 'ApplicationExpression',
 'Assignment',
 'BigramAssocMeasures',
 'BigramCollocationFinder',
 'BigramTagger',
 'BinaryMaxentFeatureEncoding',
 'BlanklineTokenizer',
 'BllipParser',
 'BottomUpChartParser',
 'BottomUpLeftCornerChartParser',
 'BottomUpProbabilisticChartParser',
 'Boxer',
 'BrillTagger',
 'BrillTaggerTrainer',
 'CFG',
 'CRFTagger',
 'CfgReadingCommand',
 'ChartParser',
 'ChunkParserI',
 'ChunkScore',
 'Cistem',
 'ClassifierBasedPOSTagger',
 'ClassifierBasedTagger',
 'ClassifierI',
 'ConcordanceIndex',
 'ConditionalExponentialClassifier',
 'ConditionalFreqDist',
 'ConditionalProbDist',
 'ConditionalProbDistI',
 'ConfusionMatrix',
 'ContextIndex',
 'ContextTagger',
 'ContingencyMeasures',
 'CoreNLPDependencyParser',
 'CoreNLPParser',
 'Counter',
 'CrossValidationProbDist',
 'DRS',
 'DecisionTreeClassifier',
 'DefaultTagger',
 'DependencyEvaluator',
 'DependencyGrammar',
 'DependencyGrap

## Things that can be done with nltk

## Stopwords: 
Words that occur in high frequency but don't contribute to the meaning of the document eg. in sentiment analysis, it is mostly noise and needs to be removed

In [10]:
#Check stopwords 
from nltk.corpus import stopwords 
stopwords.words('english')[0:500:25] # from 0 to 500 with an interval of 25 words


['i', 'herself', 'been', 'with', 'here', 'very', 'doesn', 'won']

## Reading unstrucutred text data and cleaning

### Reading file in a difficult way

In [3]:
data=open("data/SMSSpamCollection.tsv").read()
data[0:500]

"ham\tI've been searching for the right words to thank you for this breather. I promise i wont take your help for granted and will fulfil my promise. You have been wonderful and a blessing at all times.\nspam\tFree entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's\nham\tNah I don't think he goes to usf, he lives around here though\nham\tEven my brother is not like to speak with me. They treat me like aid"

In [5]:
#Since there are tabs and new line char, will replace all \t with \n and split on \n to return a list
data_list=data.replace("\t", "\n").split("\n")
data_list[0:5] #View the first 4 entrees

['ham',
 "I've been searching for the right words to thank you for this breather. I promise i wont take your help for granted and will fulfil my promise. You have been wonderful and a blessing at all times.",
 'spam',
 "Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's",
 'ham']

Data is returned in some structured form with alternating entires of ham and spam, which can be extracted further

In [10]:
labeledData= data_list[0::2]
labeledText= data_list[1::2]
print(labeledData[0:5], labeledText[0:5])

['ham', 'spam', 'ham', 'ham', 'ham'] ["I've been searching for the right words to thank you for this breather. I promise i wont take your help for granted and will fulfil my promise. You have been wonderful and a blessing at all times.", "Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's", "Nah I don't think he goes to usf, he lives around here though", 'Even my brother is not like to speak with me. They treat me like aids patent.', 'I HAVE A DATE ON SUNDAY WITH WILL!!']


In [11]:
#Convert to more structured form of a dataframe
import pandas as pd
corpus = pd.DataFrame(
         {"Label": labeledData,
         "Text": labeledText}
        )

ValueError: arrays must all be same length

For the sake of understanding, the error is not removed from notebook. Helps chanelise thoughtprocess for NLP

In [12]:
#Check length
print(len(labeledData),len(labeledText))

5571 5570


In [13]:
#Check the last element
labeledData[-1]

''

In [14]:
#Check the last element
labeledText[-1]

'Rofl. Its true to its name'

In [15]:
#Get rid of the last element while creating the dataframe
corpus = pd.DataFrame(
         {"Label": labeledData[:-1],
         "Text": labeledText}
        )
corpus

Unnamed: 0,Label,Text
0,ham,I've been searching for the right words to tha...
1,spam,Free entry in 2 a wkly comp to win FA Cup fina...
2,ham,"Nah I don't think he goes to usf, he lives aro..."
3,ham,Even my brother is not like to speak with me. ...
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!
...,...,...
5565,spam,This is the 2nd time we have tried 2 contact u...
5566,ham,Will ü b going to esplanade fr home?
5567,ham,"Pity, * was in mood for that. So...any other s..."
5568,ham,The guy did some bitching but I acted like i'd...


### Readinf file in an easy way

In [16]:
dataset = pd.read_csv ("data/SMSSpamCollection.tsv", sep = '\t', header = None)
dataset.head()

Unnamed: 0,0,1
0,ham,I've been searching for the right words to tha...
1,spam,Free entry in 2 a wkly comp to win FA Cup fina...
2,ham,"Nah I don't think he goes to usf, he lives aro..."
3,ham,Even my brother is not like to speak with me. ...
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!


## Data Exploration

In [27]:
#Assign column names
dataset.columns=["label","text"]
dataset.head()

Unnamed: 0,label,text
0,ham,I've been searching for the right words to tha...
1,spam,Free entry in 2 a wkly comp to win FA Cup fina...
2,ham,"Nah I don't think he goes to usf, he lives aro..."
3,ham,Even my brother is not like to speak with me. ...
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!


In [30]:
#Check shape, missing values, count of ham/spam
dataset.shape, dataset.isna().sum(), dataset["label"].value_counts()

((5568, 2),
 label    0
 text     0
 dtype: int64,
 ham     4822
 spam     746
 Name: label, dtype: int64)

In [None]:
## Regular Expression (Regex)
1. Identify white spaces between text/char
2. Identifying/creating delimiters or end-of-line escape characters
3. Remove punctuation/numbers or HTML tags from text
4. Identify patterns in texts