In [49]:
import nltk

## STEMMING
<p>It is the process of finding the root word, hence reduce conflicts and convert words
to their base words.</p>

In [2]:
fishwords = ['fish','Fishing','Fishes']
prt = nltk.PorterStemmer()
[prt.stem(ts) for ts in fishwords]

['fish', 'fish', 'fish']

<h2>Lemmatization</h2>
<p>It is the process to convert words into their actual dictionary form.In nltk
WordNetLemmatizer() is present to do this task</p>

In [4]:
fishwords = ['fishes','Fishings','Fishes']
WNLemma = nltk.WordNetLemmatizer()
[WNLemma.lemmatize(ts) for ts in fishwords]

['fish', 'Fishings', 'Fishes']

<h2>Tokenization</h2>
<p>This can be done by split() function available in python. But if we want to do it
more clearly, we can use nltk tokenization</p>

In [11]:
text2 = "Why are you so intelligent? Bregz@ 1309"
words = text2.split(' ')
print(words)
print(nltk.word_tokenize(text2))

['Why', 'are', 'you', 'so', 'intelligent?', 'Bregz@', '1309']
['Why', 'are', 'you', 'so', 'intelligent', '?', 'Bregz', '@', '1309']


In [12]:
text3 = "His name John. He lives in U.K. with his wife Is he the best? No he is not!"
sent1 = text3.split(".")
print(sent1)
sent2 = nltk.sent_tokenize(text3)
print(sent2)

['His name John', ' He lives in U', 'K', ' with his wife Is he the best? No he is not!']
['His name John.', 'He lives in U.K. with his wife Is he the best?', 'No he is not!']


<p>Another task that covered under Natural Language Processing is Part of Speech
(POS) Tagging. This task basically tags the words in the sentence with noun, pronoun,
verb, adjective etc.</p>

In [13]:
#using text2
tokenized_words = nltk.word_tokenize(text2)
nltk.pos_tag(tokenized_words)

[('Why', 'WRB'),
 ('are', 'VBP'),
 ('you', 'PRP'),
 ('so', 'IN'),
 ('intelligent', 'JJ'),
 ('?', '.'),
 ('Bregz', 'NNP'),
 ('@', 'NN'),
 ('1309', 'CD')]

## TEXT CLASSIFICATION

In [27]:
from sklearn.naive_bayes import MultinomialNB

In [28]:
#import newsgroup data
from sklearn.datasets import fetch_20newsgroups

In [29]:
train_data = fetch_20newsgroups(subset = 'train',shuffle = True)


In [30]:
train_data.target_names

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

In [31]:
print(train_data.data[0])

From: lerxst@wam.umd.edu (where's my thing)
Subject: WHAT car is this!?
Nntp-Posting-Host: rac3.wam.umd.edu
Organization: University of Maryland, College Park
Lines: 15

 I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is 
all I know. If anyone can tellme a model name, engine specs, years
of production, where this car is made, history, or whatever info you
have on this funky looking car, please e-mail.

Thanks,
- IL
   ---- brought to you by your neighborhood Lerxst ----







In [32]:
#convert above sentences to word vectors using CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer

count_vector = CountVectorizer()
count_vector.fit(train_data.data)

In [33]:
count_vector = CountVectorizer()
count_vector.fit(train_data.data)
train_count = count_vector.transform(train_data.data)
print(train_count.shape)

(11314, 130107)


<p>In above matrix we are storing only count of words. We can weight them based on
importance. We can use TF-IDF to do that. We need to import TFIDF</p>

In [34]:
from sklearn.feature_extraction.text import TfidfTransformer

<p>Create Object and fit</p>

In [35]:
tfidf_model = TfidfTransformer()
tfidf_model.fit(train_count)
train_tfidf = tfidf_model.transform(train_count)
train_tfidf.shape

(11314, 130107)

<p>We need to create model object</p>

In [36]:
NBModel = MultinomialNB()

<p>Fit model on training data</p>

In [37]:
NBModel.fit(train_tfidf,train_data.target)

MultinomialNB()

<p>Now to check performance of our model we can use test data available in the
same dataset</p>

In [38]:
test_data =  fetch_20newsgroups(subset='test', shuffle=True)
type(test_data.target)

numpy.ndarray

<p>Convert Testing data in the correct format</p>

In [39]:
test_count = count_vector.transform(test_data.data)
test_tfidf = tfidf_model.transform(test_count)
test_tfidf.shape

(7532, 130107)

<p>Now, we transform testing data in the correct form. We can predict the target using
predict() function</p>

In [40]:
predicted = NBModel.predict(test_tfidf)
predicted

array([ 7, 11,  0, ...,  9,  3, 15])

<p>We can evaluate our model using accuracy score. First, we import it by using scikit
learn.</p>

In [41]:
from sklearn.metrics import accuracy_score

In [42]:
print('Accuracy :{:.2f}%'.format(accuracy_score(test_data.target,predicted)*100))

Accuracy :77.39%


In [46]:
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))

In [47]:
def dataCleaner(raw_data):
    cleaned_data = []
    for row in raw_data:
        stemmed_words =[]
        tokenized = nltk.word_tokenize(row)
        prt = nltk.PorterStemmer()
        for token in tokenized:
            if token not in stop_words:
                if(token.isalnum() == True):
                    stemmed_words.append(prt.stem(token))
        sent = ' '.join(stemmed_words)
        cleaned_data.append(sent)
    return cleaned_data

In [50]:
train_data_mod = dataCleaner(train_data.data)

In [53]:
print(train_data_mod)

IOPub data rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_data_rate_limit`.

Current values:
ServerApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
ServerApp.rate_limit_window=3.0 (secs)



In [54]:
test_data_mod = dataCleaner(test_data.data)

<p>Transform data to prepare for model training</p>

In [55]:
count_vector.fit(train_data_mod)
trainmod_count = count_vector.transform(train_data_mod)
testmod_count = count_vector.transform(test_data_mod)
tfidf_model.fit(trainmod_count)
trainmod_tfidf = tfidf_model.transform(trainmod_count)
testmod_tfidf = tfidf_model.transform(testmod_count)
testmod_tfidf.shape

(7532, 77399)

<p>Build and fit model</p>

In [65]:
NBModel_mod  = MultinomialNB()
NBModel_mod.fit(trainmod_tfidf,train_data.target)

MultinomialNB()

<p>Make Predictions and Evaluate Model Perfomance</p>

In [66]:
pred = NBModel_mod.predict(testmod_tfidf)
print('Accuracy:{:.2f}'.format(accuracy_score(test_data.target,pred)*100))

Accuracy:80.28
