# NLP Text mining using nltk library

In [1]:
import nltk
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os
%matplotlib inline

**Corpus:** bunch of text. ex: article, blog, book content, etc

we will do the text mining in three stages:

**1. Corpus Cleaning**

**2. DTM (Document Term Matrix) with TFIDF**

**3. Model Building**

### Corpus Cleaning

Steps:
    1. Tokenize
    2. Convert Multi Lingual Text
    3. Convert the corpus to one case (lower or Upper)
    4. remove punctuations
    5. remove white space
    6. remove stop words
    7. lemmatization / Stemming
    8. De-Tokenize

In [2]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.tokenize.treebank import TreebankWordDetokenizer
from nltk.stem import wordnet
from nltk.stem import WordNetLemmatizer
from nltk.stem import PorterStemmer
import string
stop_words = stopwords.words('english')

In [3]:
def clean_corpus(corpus, tocase='lower', remove_punc=True, punctuations=list(string.punctuation), remove_whitespace =True,
                 stopwords=stop_words, lemmatize=False):
    """
    Takes the corpus as input and performs the corpus cleaning as required,
    then returns the detokenized corpus.
    """
    cleaned_corpus = corpus
    # Tokenize
    tokens = word_tokenize(corpus)
    
    #Convert Multi Lingual Text
    # will be done later
    
    # Convert the corpus to one case (lower or Upper)
    valid_tokens = [token.lower() for token in tokens]
    
    # Remove Punctuations
    if remove_punc:
        valid_tokens = [token for token in valid_tokens if token not in punctuations]
    
    # Remove White Space
    # will be done later
    
    # Remove stop words
    valid_tokens = [token for token in valid_tokens if token not in stop_words]
    
    #lemmatization / stemming
    if lemmatize:
        word_lem = WordNetLemmatizer()
        valid_tokens = [word_lem.lemmatize(token) for token in valid_tokens]
    else:
        pst = PorterStemmer()
        valid_tokens = [pst.stem(token) for token in valid_tokens]
    
    # De-tokenize
    cleaned_corpus = "".join([" "+i if not i.startswith("'") and i not in '!%\'()*+,-./:;<=>?@[\\]^_`{|}~'
                              else i for i in valid_tokens]).strip()
    return cleaned_corpus
    

In [4]:
text = """
       The senctence here ~ is Tokenized, Cleaning is #3 performed - & then ^ it is detokeized and sent back.
       
       This is the Second paragraph of the corpus.
       Let's Check if this is Efficient enough.
       """
clean_corpus(text, lemmatize=True)

"senctence tokenized cleaning 3 performed detokeized sent back second paragraph corpus let's check efficient enough"

In [5]:
bbc_tags = []
bbc_data = []
files_path = "00-Data/rawData/bbc-fulltext/bbc/"
for folder in os.listdir(files_path):
    if folder.lower() not in "readme.txt":
        for file in os.listdir(os.path.join(files_path, folder)):
            ofile = open(os.path.join(files_path, folder, file))
            file_content = ofile.read()
            if file_content is not None:
                bbc_tags.append(folder)
                bbc_data.append(file_content)

In [6]:
bbc_df = pd.DataFrame({'Category': bbc_tags, 'text': bbc_data})

In [7]:
bbc_df['cleaned_text_stem'] = list(map((lambda text : clean_corpus(text)), bbc_df['text']))

In [8]:
bbc_df['cleaned_text_lemm'] = list(map((lambda text : clean_corpus(text, lemmatize=True)), bbc_df['text']))

In [9]:
bbc_df.head()

Unnamed: 0,Category,text,cleaned_text_stem,cleaned_text_lemm
0,business,Ad sales boost Time Warner profit\n\nQuarterly...,ad sale boost time warner profit quarterli pro...,ad sale boost time warner profit quarterly pro...
1,business,Dollar gains on Greenspan speech\n\nThe dollar...,dollar gain greenspan speech dollar hit highes...,dollar gain greenspan speech dollar hit highes...
2,business,Yukos unit buyer faces loan claim\n\nThe owner...,yuko unit buyer face loan claim owner embattl ...,yukos unit buyer face loan claim owner embattl...
3,business,High fuel prices hit BA's profits\n\nBritish A...,high fuel price hit ba's profit british airway...,high fuel price hit ba's profit british airway...
4,business,Pernod takeover talk lifts Domecq\n\nShares in...,pernod takeov talk lift domecq share uk drink ...,pernod takeover talk lift domecq share uk drin...


In [11]:
bbc_df['text_stem_tokens'] = bbc_df['cleaned_text_stem'].apply(word_tokenize)
bbc_df['text_lemm_tokens'] = bbc_df['cleaned_text_lemm'].apply(word_tokenize)
bbc_df.head()

Unnamed: 0,Category,text,cleaned_text_stem,cleaned_text_lemm,text_stem_tokens,text_lemm_tokens
0,business,Ad sales boost Time Warner profit\n\nQuarterly...,ad sale boost time warner profit quarterli pro...,ad sale boost time warner profit quarterly pro...,"[ad, sale, boost, time, warner, profit, quarte...","[ad, sale, boost, time, warner, profit, quarte..."
1,business,Dollar gains on Greenspan speech\n\nThe dollar...,dollar gain greenspan speech dollar hit highes...,dollar gain greenspan speech dollar hit highes...,"[dollar, gain, greenspan, speech, dollar, hit,...","[dollar, gain, greenspan, speech, dollar, hit,..."
2,business,Yukos unit buyer faces loan claim\n\nThe owner...,yuko unit buyer face loan claim owner embattl ...,yukos unit buyer face loan claim owner embattl...,"[yuko, unit, buyer, face, loan, claim, owner, ...","[yukos, unit, buyer, face, loan, claim, owner,..."
3,business,High fuel prices hit BA's profits\n\nBritish A...,high fuel price hit ba's profit british airway...,high fuel price hit ba's profit british airway...,"[high, fuel, price, hit, ba, 's, profit, briti...","[high, fuel, price, hit, ba, 's, profit, briti..."
4,business,Pernod takeover talk lifts Domecq\n\nShares in...,pernod takeov talk lift domecq share uk drink ...,pernod takeover talk lift domecq share uk drin...,"[pernod, takeov, talk, lift, domecq, share, uk...","[pernod, takeover, talk, lift, domecq, share, ..."


## 2. DTM (Document Term Matrix) with TFIDF

In [13]:
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression

In [14]:
bbc_df_mod = bbc_df.copy()
bbc_df_mod.head()

Unnamed: 0,Category,text,cleaned_text_stem,cleaned_text_lemm,text_stem_tokens,text_lemm_tokens
0,business,Ad sales boost Time Warner profit\n\nQuarterly...,ad sale boost time warner profit quarterli pro...,ad sale boost time warner profit quarterly pro...,"[ad, sale, boost, time, warner, profit, quarte...","[ad, sale, boost, time, warner, profit, quarte..."
1,business,Dollar gains on Greenspan speech\n\nThe dollar...,dollar gain greenspan speech dollar hit highes...,dollar gain greenspan speech dollar hit highes...,"[dollar, gain, greenspan, speech, dollar, hit,...","[dollar, gain, greenspan, speech, dollar, hit,..."
2,business,Yukos unit buyer faces loan claim\n\nThe owner...,yuko unit buyer face loan claim owner embattl ...,yukos unit buyer face loan claim owner embattl...,"[yuko, unit, buyer, face, loan, claim, owner, ...","[yukos, unit, buyer, face, loan, claim, owner,..."
3,business,High fuel prices hit BA's profits\n\nBritish A...,high fuel price hit ba's profit british airway...,high fuel price hit ba's profit british airway...,"[high, fuel, price, hit, ba, 's, profit, briti...","[high, fuel, price, hit, ba, 's, profit, briti..."
4,business,Pernod takeover talk lifts Domecq\n\nShares in...,pernod takeov talk lift domecq share uk drink ...,pernod takeover talk lift domecq share uk drin...,"[pernod, takeov, talk, lift, domecq, share, uk...","[pernod, takeover, talk, lift, domecq, share, ..."


In [58]:
def vectorize(vec, X_train_, X_test_):
    print("Vectorization Started......\n")
    X_train_vec = vec.fit_transform(X_train_)
    X_test_vec = vec.transform(X_test_)
    print("Vectorization completed.\n")
    return X_train_vec, X_test_vec

**Encode Label**

In [22]:
enc = LabelEncoder()
bbc_df['Category'] = enc.fit_transform(bbc_df['Category'])
labels = (enc.classes_)
labels

array(['business', 'entertainment', 'politics', 'sport', 'tech'],
      dtype=object)

In [33]:
bbc_df.head()

Unnamed: 0,Category,text,cleaned_text_stem,cleaned_text_lemm,text_stem_tokens,text_lemm_tokens
0,0,Ad sales boost Time Warner profit\n\nQuarterly...,ad sale boost time warner profit quarterli pro...,ad sale boost time warner profit quarterly pro...,"[ad, sale, boost, time, warner, profit, quarte...","[ad, sale, boost, time, warner, profit, quarte..."
1,0,Dollar gains on Greenspan speech\n\nThe dollar...,dollar gain greenspan speech dollar hit highes...,dollar gain greenspan speech dollar hit highes...,"[dollar, gain, greenspan, speech, dollar, hit,...","[dollar, gain, greenspan, speech, dollar, hit,..."
2,0,Yukos unit buyer faces loan claim\n\nThe owner...,yuko unit buyer face loan claim owner embattl ...,yukos unit buyer face loan claim owner embattl...,"[yuko, unit, buyer, face, loan, claim, owner, ...","[yukos, unit, buyer, face, loan, claim, owner,..."
3,0,High fuel prices hit BA's profits\n\nBritish A...,high fuel price hit ba's profit british airway...,high fuel price hit ba's profit british airway...,"[high, fuel, price, hit, ba, 's, profit, briti...","[high, fuel, price, hit, ba, 's, profit, briti..."
4,0,Pernod takeover talk lifts Domecq\n\nShares in...,pernod takeov talk lift domecq share uk drink ...,pernod takeover talk lift domecq share uk drin...,"[pernod, takeov, talk, lift, domecq, share, uk...","[pernod, takeover, talk, lift, domecq, share, ..."


In [46]:
X_train_stem, X_test_stem, y_train_stem, y_test_stem = train_test_split(bbc_df['cleaned_text_stem'],
                                                                    bbc_df['Category'], test_size=0.2, random_state = 12)
len(X_train_stem), len(X_test_stem), len(y_train_stem), len(y_test_stem)

(1780, 445, 1780, 445)

In [47]:
X_train_lemm, X_test_lemm, y_train_lemm, y_test_lemm = train_test_split(bbc_df['cleaned_text_lemm'],
                                                                    bbc_df['Category'], test_size=0.2, random_state = 12)
len(X_train_lemm), len(X_test_lemm), len(y_train_lemm), len(y_test_lemm)

(1780, 445, 1780, 445)

In [59]:
X_train_stem_vec, X_test_stem_vec = vectorize(TfidfVectorizer(), X_train_stem, X_test_stem)
X_train_lemm_vec, X_test_lemm_vec = vectorize(TfidfVectorizer(), X_train_lemm, X_test_lemm)

Vectorization Started......

Vectorization completed.

Vectorization Started......

Vectorization completed.



## 3. Model Building

In [63]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

In [80]:
models = {'Naive_Bayes': MultinomialNB(),
          'Decision_Tree': DecisionTreeClassifier(),
          'Logistic_regression': LogisticRegression(),
          'Random_Forest': RandomForestClassifier(),
          'SVC': SVC()}

In [85]:
def evaluate_models(models_, mX_train, mX_test, my_train, my_test):
    """
    Takes models, features & labels as inputs and returns the metrics as outputs.
    For Classification models only.
    """
    evaluations = {}
    for name, model in models_.items():
        print('Model Building for: ', name)
        model.fit(mX_train, my_train)
        my_pred = model.predict(mX_test)
        acc = accuracy_score(my_test, my_pred)
        pre = precision_score(my_test, my_pred, average='weighted')
        rec = recall_score(my_test, my_pred, average='weighted')
        f1 = f1_score(my_test, my_pred, average='weighted')
        evaluations[name] = {'accuracy': acc,
                             'precision': pre,
                             'recall': rec,
                             'f1': f1}
    for model_name, metrics in evaluations.items():
        print('*'*10, model_name, '*'*10)
        print(metrics)
    return evaluations

**For Stemmed Data**

In [86]:
evals_stem = evaluate_models(models, X_train_stem_vec, X_test_stem_vec, y_train_stem, y_test_stem)

Model Building for:  Naive_Bayes
Model Building for:  Decision_Tree
Model Building for:  Logistic_regression
Model Building for:  Random_Forest
Model Building for:  SVC
********** Naive_Bayes **********
{'accuracy': 0.9617977528089887, 'precision': 0.9631049657955838, 'recall': 0.9617977528089887, 'f1': 0.9617182102059925}
********** Decision_Tree **********
{'accuracy': 0.8292134831460675, 'precision': 0.8347615209670171, 'recall': 0.8292134831460675, 'f1': 0.8291432335379542}
********** Logistic_regression **********
{'accuracy': 0.9752808988764045, 'precision': 0.9752766087923209, 'recall': 0.9752808988764045, 'f1': 0.9752319845293916}
********** Random_Forest **********
{'accuracy': 0.952808988764045, 'precision': 0.9542989959338058, 'recall': 0.952808988764045, 'f1': 0.9527667189067466}
********** SVC **********
{'accuracy': 0.9730337078651685, 'precision': 0.9733740386631895, 'recall': 0.9730337078651685, 'f1': 0.9730455477133734}


In [87]:
eval_lemm = evaluate_models(models, X_train_lemm_vec, X_test_lemm_vec, y_train_lemm, y_test_lemm)

Model Building for:  Naive_Bayes
Model Building for:  Decision_Tree
Model Building for:  Logistic_regression
Model Building for:  Random_Forest
Model Building for:  SVC
********** Naive_Bayes **********
{'accuracy': 0.9595505617977528, 'precision': 0.9612119883405167, 'recall': 0.9595505617977528, 'f1': 0.9594871295681345}
********** Decision_Tree **********
{'accuracy': 0.8224719101123595, 'precision': 0.8233665766628923, 'recall': 0.8224719101123595, 'f1': 0.8225640274466407}
********** Logistic_regression **********
{'accuracy': 0.9775280898876404, 'precision': 0.9775955787055953, 'recall': 0.9775280898876404, 'f1': 0.9774674474693473}
********** Random_Forest **********
{'accuracy': 0.9573033707865168, 'precision': 0.9580477308751878, 'recall': 0.9573033707865168, 'f1': 0.9571817585631225}
********** SVC **********
{'accuracy': 0.9752808988764045, 'precision': 0.9753693637321044, 'recall': 0.9752808988764045, 'f1': 0.9752354806260389}


> **Observations:** Performing Lemmatization appears to be marginally better than Stemming

In [97]:
label_decoder = {}
for i in range(len(enc.classes_)):
    label_decoder[i] = enc.classes_[i]

> **Testing a few Articles Manually........**

In [134]:
def predict_article_class(pmodel, X):
    vec = TfidfVectorizer()
    vec.fit_transform(X_train_lemm)
    X_clean = pd.Series(clean_corpus(X))
    X_clean_vec = vec.transform(X_clean)
    y_pred = pmodel.predict(X_clean_vec)
    print("Predicted Label:", label_decoder[y_pred[0]])

In [106]:
X_test_stem

1329    call kenteri clear kosta kenteri lawyer call d...
91      iranian mp threaten mobil deal turkey's bigges...
733     farrel due make us tv debut actor colin farrel...
1430    parri relish anfield challeng bbc sport reflec...
784     produc scoop stage award produc beaten mari po...
                              ...                        
251     bt offer equal access rival bt move pre-empt p...
1262    uk head wrong way say howard toni blair chanc ...
1871    'no re-draft eu patent law propos european law...
1006    kilroy launch'verita parti ex-bbc chat show ho...
821     smith lose us box offic crown new comedi diari...
Name: cleaned_text_stem, Length: 445, dtype: object

In [111]:
sample_test_indices = [1329,91,733,1430,784,1262]

In [114]:
test_classifier = LogisticRegression()
test_classifier.fit(X_train_lemm_vec, y_train_lemm)

LogisticRegression()

**Test 1**

In [115]:
bbc_df['text'].iloc[sample_test_indices[0]]

'Call for Kenteris to be cleared\n\nKostas Kenteris\' lawyer has called for the doping charges against the Greek sprinter to be dropped.\n\nGregory Ioannidis has submitted new evidence to a Greek athletics tribunal which he claims proves the former Olympic champion has no case to answer. Kenteris and compatriot Katerina Thanou were given provisional suspensions in December for failing to take drugs tests before the Athens Olympics. The Greek tribunal is expected to give its verdict early next week. Kenteris and Thanou withdrew from the Athens Olympics last August after missing drugs tests on the eve of the opening ceremony. They were also alleged to have avoided tests in Tel Aviv and Chicago before the Games.\n\nBut Ioannidis said: "Everything overwhelmingly shows that the charges should be dropped." Ioannidis also said he has presented evidence that will throw a different light on the events leading up to the pair\'s sensational withdrawal from the Athens Games. The lawyer added that 

In [117]:
print("True Label: ", label_decoder[bbc_df['Category'].iloc[sample_test_indices[0]]])

True Label:  sport


In [135]:
predict_article_class(test_classifier, bbc_df['text'][sample_test_indices[0]])

Predicted Label: sport


**Test 2**

In [138]:
bbc_df['text'].iloc[sample_test_indices[1]]

'Iranian MPs threaten mobile deal\n\nTurkey\'s biggest private mobile firm could bail out of a $3bn ($1.6bn) deal to build a network in Iran after MPs there slashed its stake in the project.\n\nConservatives in parliament say Turkcell\'s stake in Irancell, the new network, should be cut from 70% to 49%. They have already given themselves a veto over all foreign investment deals, following allegations about Turkish firms\' involvement in Israel. Turkcell now says it may give up on the deal altogether.\n\nIran currently has only one heavily congested mobile network, with long waiting lists for new subscribers. Turkcell signed a contract for the new network in September. The new operator planned to offer subscriptions for about $180, well below the existing firm\'s $500 price tag. But a parliamentary commission has now ruled that Turkcell\'s 70% controlling stake is too high. They say that Turkcell is a security risk because of alleged business ties with Israel. Parliament as a whole - do

In [139]:
print("True Label: ", label_decoder[bbc_df['Category'].iloc[sample_test_indices[1]]])

True Label:  business


In [140]:
predict_article_class(test_classifier, bbc_df['text'][sample_test_indices[1]])

Predicted Label: business


**Test 3**

In [142]:
bbc_df['text'].iloc[sample_test_indices[2]]

"Farrell due to make US TV debut\n\nActor Colin Farrell is to make his debut on US television in medical sitcom Scrubs, according to Hollywood newspaper Daily Variety.\n\nThe film star, who recently played the title role in historical blockbuster Alexander, will make a cameo appearance as an unruly Irishman. The episode featuring the 28-year-old will be screened on 25 January. Farrell's appearance is said to be a result of his friendship with Zach Braff, who stars in the programme. It will be the actor's first appearance on the small screen since he appeared in BBC series Ballykissangel in 1999. The gentle Sunday night drama came to an end in 2001.\n\nHe has since become one of Hollywood's fastest-rising stars, with a string roles in major league films such as Minority Report, Phone Booth and Daredevil. Farrell is pencilled in to play the role of Crockett in a film version of 1980s police drama Miami Vice. Scrubs, which appears on the NBC network in the US and has been shown on Channel

In [143]:
print("True Label: ", label_decoder[bbc_df['Category'].iloc[sample_test_indices[2]]])

True Label:  entertainment


In [144]:
predict_article_class(test_classifier, bbc_df['text'][sample_test_indices[2]])

Predicted Label: entertainment


**Test 4**

In [146]:
bbc_df['text'].iloc[sample_test_indices[3]]

'Parry relishes Anfield challenge\n\nBBC Sport reflects on the future for Liverpool after our exclusive interview with chief executive Rick Parry.\n\nChief executive Parry is the man at the helm as Liverpool reach the most crucial point in their recent history. Parry has to deliver a new 60,000-seat stadium in Stanley Park by 2007 amid claims of costs spiralling above Â£120m. He is also searching for an investment package of a size and stature that will restore Liverpool to their place at European football\'s top table. But it is a challenge that appears to sit easily with Parry, who has forged a reputation as one of football\'s most respected administrators since his days at the fledgling Premier League.\n\nLiverpool have not won the championship since 1990, a fact that causes deep discomfort inside Anfield as they attempt to muscle in on the top three of Chelsea, Manchester United and Arsenal. Throw in the small matter of warding off every top club in world football as they eye capta

In [147]:
print("True Label: ", label_decoder[bbc_df['Category'].iloc[sample_test_indices[3]]])

True Label:  sport


In [148]:
predict_article_class(test_classifier, bbc_df['text'][sample_test_indices[3]])

Predicted Label: sport


**Test 5**

In [150]:
bbc_df['text'].iloc[sample_test_indices[4]]

"The Producers scoops stage awards\n\nThe Producers has beaten Mary Poppins in the battle of the blockbuster West End musicals at the Olivier Awards.\n\nThe Producers won three prizes at the UK's most prestigious annual theatre awards, while Mary Poppins won two. Mel Brooks' hit show triumphed in the battle for best new musical, where it was up against Mary Poppins and Andrew Lloyd Webber's The Woman in White. Alan Bennett's The History Boys was the big winner in the straight theatre categories, picking up three trophies. But all eyes were on the musical prizes after The Producers, Mary Poppins and The Woman in White all had high-profile openings in the last six months.\n\nThe Producers' Nathan Lane, a last-minute replacement for Richard Dreyfuss, beat his former co-star Lee Evans to win best musical actor. Lane has already left the production. A smash hit on Broadway before moving to London, the show also won best musical performance in a supporting role for Conleth Hill, who plays di

In [152]:
print("True Label: ", label_decoder[bbc_df['Category'].iloc[sample_test_indices[4]]])

True Label:  entertainment


In [153]:
predict_article_class(test_classifier, bbc_df['text'][sample_test_indices[4]])

Predicted Label: entertainment


**Test 6**

In [156]:
bbc_df['text'].iloc[sample_test_indices[5]]

'UK heading wrong way, says Howard\n\nTony Blair has had the chance to tackle the problems facing Britain and has failed, Michael Howard has said.\n\n"Britain is heading in the wrong direction", the Conservative leader said in his New Year message. Mr Blair\'s government was a "bossy, interfering government that takes decisions that should be made by individuals," he added. But Labour\'s campaign spokesman Fraser Kemp responded: "Britain is working, don\'t let the Tories wreck it again". Mr Howard also paid tribute to the nation\'s character for its generous response to the Asian quake disaster. The catastrophe was overshadowing the hopes for the future at this usually positive time of the year, Mr Howard said.\n\n"We watched the scenes of destruction with a sense of disbelief. The scale, the speed, the ferocity of what happened on Boxing Day is difficult to grasp. "Yet Britain\'s response has shone a light on our nation\'s character. The last week has shown that the warm, caring heart

In [158]:
print("True Label: ", label_decoder[bbc_df['Category'].iloc[sample_test_indices[5]]])

True Label:  politics


In [159]:
predict_article_class(test_classifier, bbc_df['text'][sample_test_indices[5]])

Predicted Label: politics
