### IMPORTING LIBRARIES

In [87]:
import nltk
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
import re
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/kshitijranjan/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [109]:
import os
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from gensim.models import Word2Vec

#### INPUT IS THE PRESIDENTIAL SPEECH FROM Dr. APJ Abdul Kalam

In [89]:
paragraph = """I have three visions for India. In 3000 years of our history, people from all over 
               the world have come and invaded us, captured our lands, conquered our minds. 
               From Alexander onwards, the Greeks, the Turks, the Moguls, the Portuguese, the British,
               the French, the Dutch, all of them came and looted us, took over what was ours. 
               Yet we have not done this to any other nation. We have not conquered anyone. 
               We have not grabbed their land, their culture, 
               their history and tried to enforce our way of life on them. 
               Why? Because we respect the freedom of others.That is why my 
               first vision is that of freedom. I believe that India got its first vision of 
               this in 1857, when we started the War of Independence. It is this freedom that
               we must protect and nurture and build on. If we are not free, no one will respect us.
               My second vision for India’s development. For fifty years we have been a developing nation.
               It is time we see ourselves as a developed nation. We are among the top 5 nations of the world
               in terms of GDP. We have a 10 percent growth rate in most areas. Our poverty levels are falling.
               Our achievements are being globally recognised today. Yet we lack the self-confidence to
               see ourselves as a developed nation, self-reliant and self-assured. Isn’t this incorrect?
               I have a third vision. India must stand up to the world. Because I believe that unless India 
               stands up to the world, no one will respect us. Only strength respects strength. We must be 
               strong not only as a military power but also as an economic power. Both must go hand-in-hand. 
               My good fortune was to have worked with three great minds. Dr. Vikram Sarabhai of the Dept. of 
               space, Professor Satish Dhawan, who succeeded him and Dr. Brahm Prakash, father of nuclear material.
               I was lucky to have worked with all three of them closely and consider this the great opportunity of my life. 
               I see four milestones in my career"""

##### CREATING SENTENCES FROM THE PARAGRAPH AND INITIALISING STEMMER AND LEMMATISER FUNCTION

#### Stemming
Stemming produces the root for of the word. The word may or may not be in the vocabulary. It is mainly used for Sentiment analysis where the root word is more necessary than finding the true word

In [90]:
sentences = nltk.sent_tokenize(paragraph)
stemmer = PorterStemmer()

In [91]:
### Stemming
for i in range(len(sentences)):
    words = nltk.word_tokenize(sentences[i])
    words = [stemmer.stem(word) for word in words if word not in set(stopwords.words('english'))]
    sentences[i] = ' '.join(words)
#### After removing the stop words and replacing with the stem words
print(sentences)

['i three vision india .', 'in 3000 year histori , peopl world come invad us , captur land , conquer mind .', 'from alexand onward , greek , turk , mogul , portugues , british , french , dutch , came loot us , took .', 'yet done nation .', 'we conquer anyon .', 'we grab land , cultur , histori tri enforc way life .', 'whi ?', 'becaus respect freedom others.that first vision freedom .', 'i believ india got first vision 1857 , start war independ .', 'it freedom must protect nurtur build .', 'if free , one respect us .', 'my second vision india ’ develop .', 'for fifti year develop nation .', 'it time see develop nation .', 'we among top 5 nation world term gdp .', 'we 10 percent growth rate area .', 'our poverti level fall .', 'our achiev global recognis today .', 'yet lack self-confid see develop nation , self-reli self-assur .', 'isn ’ incorrect ?', 'i third vision .', 'india must stand world .', 'becaus i believ unless india stand world , one respect us .', 'onli strength respect stre

#### Lemmatization
Lemmatization is the process of finding the root word where the root word is a part of the dictionary. This is used for extracting sense out of questionnaire response. Slow process since the function searches in the language dictionary for the set of words

In [92]:
sentences = nltk.sent_tokenize(paragraph)
lemmatizer = WordNetLemmatizer()

In [93]:
### Lemmatization
for i in range(len(sentences)):
    words = nltk.word_tokenize(sentences[i])
    words = [lemmatizer.lemmatize(word) for word in words if word not in set(stopwords.words('english'))]
    sentences[i] = ' '.join(words)
#### After removing the stop words and replacing with the lemmatised words
print(sentences)

['I three vision India .', 'In 3000 year history , people world come invaded u , captured land , conquered mind .', 'From Alexander onwards , Greeks , Turks , Moguls , Portuguese , British , French , Dutch , came looted u , took .', 'Yet done nation .', 'We conquered anyone .', 'We grabbed land , culture , history tried enforce way life .', 'Why ?', 'Because respect freedom others.That first vision freedom .', 'I believe India got first vision 1857 , started War Independence .', 'It freedom must protect nurture build .', 'If free , one respect u .', 'My second vision India ’ development .', 'For fifty year developing nation .', 'It time see developed nation .', 'We among top 5 nation world term GDP .', 'We 10 percent growth rate area .', 'Our poverty level falling .', 'Our achievement globally recognised today .', 'Yet lack self-confidence see developed nation , self-reliant self-assured .', 'Isn ’ incorrect ?', 'I third vision .', 'India must stand world .', 'Because I believe unless 

### BAG OF WORDS


#### Cleaning of text

In [118]:
sentences = nltk.sent_tokenize(paragraph)
lemmatizer = WordNetLemmatizer()
corpus = []

In [119]:
for i in range(len(sentences)):
    review = re.sub('[^a-zA-Z]'," ",sentences[i])
    review = review.lower()
    review = review.split()
    review = [lemmatizer.lemmatize(word) for word in review if word not in set(stopwords.words('english'))]
    review = " ".join(review)
    corpus.append(review)

['i', 'have', 'three', 'visions', 'for', 'india']
['in', 'years', 'of', 'our', 'history', 'people', 'from', 'all', 'over', 'the', 'world', 'have', 'come', 'and', 'invaded', 'us', 'captured', 'our', 'lands', 'conquered', 'our', 'minds']
['from', 'alexander', 'onwards', 'the', 'greeks', 'the', 'turks', 'the', 'moguls', 'the', 'portuguese', 'the', 'british', 'the', 'french', 'the', 'dutch', 'all', 'of', 'them', 'came', 'and', 'looted', 'us', 'took', 'over', 'what', 'was', 'ours']
['yet', 'we', 'have', 'not', 'done', 'this', 'to', 'any', 'other', 'nation']
['we', 'have', 'not', 'conquered', 'anyone']
['we', 'have', 'not', 'grabbed', 'their', 'land', 'their', 'culture', 'their', 'history', 'and', 'tried', 'to', 'enforce', 'our', 'way', 'of', 'life', 'on', 'them']
['why']
['because', 'we', 'respect', 'the', 'freedom', 'of', 'others', 'that', 'is', 'why', 'my', 'first', 'vision', 'is', 'that', 'of', 'freedom']
['i', 'believe', 'that', 'india', 'got', 'its', 'first', 'vision', 'of', 'this', 'i

#### Creating the model for Bag of Word

In [96]:
cv = CountVectorizer(max_features=1500)
X = cv.fit_transform(corpus).toarray()

In [97]:
X

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 1, 1, 0],
       [0, 1, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

### TF-IDF

#### Term Frequency
(Number of repetition of words in sentence)/(Number of words in sentence)
#### Inverse Document Frequency
log(Number of sentences)/(Number of sentences containing word)

In [98]:
### Cleaning the text and initializing lemmatizer
sentences = nltk.sent_tokenize(paragraph)
lemmatizer = WordNetLemmatizer()
corpus = []
for i in range(len(sentences)):
    review = re.sub('[^a-zA-Z]'," ",sentences[i])
    review = review.lower()
    review = review.split()
    review = [lemmatizer.lemmatize(word) for word in review if word not in set(stopwords.words('english'))]
    review = " ".join(review)
    corpus.append(review)


##### Creating TFIDF Model

In [99]:
tfidf = TfidfVectorizer()
X = tfidf.fit_transform(corpus).toarray()

In [100]:
X

array([[0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.25883507, 0.30512561,
        0.        ],
       [0.        , 0.28867513, 0.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ]])

## SPAM SMS CLASSIFIER

In [101]:
### Setting the working directory
os.chdir(os.path.dirname(os.getcwd())+'/input')
os.getcwd()

### Reading the Spam Input
messages = pd.read_csv('smsspamcollection/SMSSpamCollection', sep = '\t', names = ["label","message"])
messages.head(20)

Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
5,spam,FreeMsg Hey there darling it's been 3 week's n...
6,ham,Even my brother is not like to speak with me. ...
7,ham,As per your request 'Melle Melle (Oru Minnamin...
8,spam,WINNER!! As a valued network customer you have...
9,spam,Had your mobile 11 months or more? U R entitle...


### Cleaning the Dataset

In [102]:
### Cleaning the text and initializing lemmatizer
lemmatizer = WordNetLemmatizer()
corpus = []
messages['Clean'] = ''
for i in range(len(messages)):
    review = re.sub('[^a-zA-Z]'," ",messages['message'][i])
    review = review.lower()
    review = review.split()
    review = [lemmatizer.lemmatize(word) for word in review if word not in set(stopwords.words('english'))]
    review = " ".join(review)
    messages['Clean'][i] = review
    corpus.append(review)

### Creating a Bag of Words

In [103]:
### Creating Bag of Words for the Independent Variable
cv = CountVectorizer(max_features=5000)
X = cv.fit_transform(corpus).toarray()

In [104]:
### Creating array of Dependent Variable
y = pd.get_dummies(messages['label'])
y = y.iloc[:,1].values

###### Splitting the dataset into Test and Train

In [105]:
X_train, X_test, Y_train, Y_test = train_test_split(X,y,test_size=0.20,random_state=0)

##### Fitting the Multinomial NB model on the Train data and predicting on Test Data

In [106]:
spam_detect_model = MultinomialNB().fit(X_train,Y_train)
Y_pred = spam_detect_model.predict(X_test)

###### Creating Confusion matrix and checking accuracy

In [107]:
conf_m = confusion_matrix(Y_test,Y_pred)
accuracy = accuracy_score(Y_test,Y_pred)
accuracy

0.9820627802690582

### WORD2VEC
Both BOW and TF-IDF approach semantic information is not stored. TF-IDF gives importance to uncommon words in the corpus

*This gives chance to Over-fitting*

In word2vec
-  each word is basically represented as a vector of 32 or more dimensions. instead of a single number
-  Here semantic information and relation between different words is also preserved
-  Similar words are present close to each other on the vector space

In [122]:
### Pre processing the paragraph
text = re.sub(r'\[[0-9]*\]'," ",paragraph)
text = re.sub(r'\s+'," ",text)
text = text.lower()
text = re.sub(r'\d'," ",text)
text = re.sub(r'\s+'," ",text)
### Converting paragraph into sentences
sentences = nltk.sent_tokenize(text)
sentences = [nltk.word_tokenize(sentence) for sentence in sentences]
### Removing stop words from the sentences
for i in range(len(sentences)):
    sentences[i] = [word for word in sentences[i] if word not in set(stopwords.words('english'))]

[['three', 'visions', 'india', '.'], ['years', 'history', ',', 'people', 'world', 'come', 'invaded', 'us', ',', 'captured', 'lands', ',', 'conquered', 'minds', '.'], ['alexander', 'onwards', ',', 'greeks', ',', 'turks', ',', 'moguls', ',', 'portuguese', ',', 'british', ',', 'french', ',', 'dutch', ',', 'came', 'looted', 'us', ',', 'took', '.'], ['yet', 'done', 'nation', '.'], ['conquered', 'anyone', '.'], ['grabbed', 'land', ',', 'culture', ',', 'history', 'tried', 'enforce', 'way', 'life', '.'], ['?'], ['respect', 'freedom', 'others.that', 'first', 'vision', 'freedom', '.'], ['believe', 'india', 'got', 'first', 'vision', ',', 'started', 'war', 'independence', '.'], ['freedom', 'must', 'protect', 'nurture', 'build', '.'], ['free', ',', 'one', 'respect', 'us', '.'], ['second', 'vision', 'india', '’', 'development', '.'], ['fifty', 'years', 'developing', 'nation', '.'], ['time', 'see', 'developed', 'nation', '.'], ['among', 'top', 'nations', 'world', 'terms', 'gdp', '.'], ['percent', 'gr

#### Building the Word2Vec model

In [None]:
#### Creating the model object
model = Word2Vec(sentences,min_count=1)
words = model.wv.vectors

In [129]:
### Similar words to war
similar = model.wv.most_similar('vikram')
print('Word similar to vikram is: ',similar)

Word similar to vikram is:  [('visions', 0.18146507441997528), ('growth', 0.1663494110107422), ('one', 0.1643451601266861), ('took', 0.16432958841323853), (',', 0.15887504816055298), ('fifty', 0.1472669392824173), ('developing', 0.14714017510414124), ('worked', 0.13810548186302185), ('development', 0.1376984417438507), ('time', 0.13293510675430298)]


## Stock Sentiment Analysis

In [131]:
### Reading the Stock sentiment Input
data = pd.read_csv('Stock_data.csv')
data.head(20)

Unnamed: 0,Date,Label,Top1,Top2,Top3,Top4,Top5,Top6,Top7,Top8,...,Top16,Top17,Top18,Top19,Top20,Top21,Top22,Top23,Top24,Top25
0,2000-01-03,0,A 'hindrance to operations': extracts from the...,Scorecard,Hughes' instant hit buoys Blues,Jack gets his skates on at ice-cold Alex,Chaos as Maracana builds up for United,Depleted Leicester prevail as Elliott spoils E...,Hungry Spurs sense rich pickings,Gunners so wide of an easy target,...,Flintoff injury piles on woe for England,Hunters threaten Jospin with new battle of the...,Kohl's successor drawn into scandal,The difference between men and women,"Sara Denver, nurse turned solicitor",Diana's landmine crusade put Tories in a panic,Yeltsin's resignation caught opposition flat-f...,Russian roulette,Sold out,Recovering a title
1,2000-01-04,0,Scorecard,The best lake scene,Leader: German sleaze inquiry,"Cheerio, boyo",The main recommendations,Has Cubie killed fees?,Has Cubie killed fees?,Has Cubie killed fees?,...,On the critical list,The timing of their lives,Dear doctor,Irish court halts IRA man's extradition to Nor...,Burundi peace initiative fades after rebels re...,PE points the way forward to the ECB,Campaigners keep up pressure on Nazi war crime...,Jane Ratcliffe,Yet more things you wouldn't know without the ...,Millennium bug fails to bite
2,2000-01-05,0,Coventry caught on counter by Flo,United's rivals on the road to Rio,Thatcher issues defence before trial by video,Police help Smith lay down the law at Everton,Tale of Trautmann bears two more retellings,England on the rack,Pakistan retaliate with call for video of Walsh,Cullinan continues his Cape monopoly,...,South Melbourne (Australia),Necaxa (Mexico),Real Madrid (Spain),Raja Casablanca (Morocco),Corinthians (Brazil),Tony's pet project,Al Nassr (Saudi Arabia),Ideal Holmes show,Pinochet leaves hospital after tests,Useful links
3,2000-01-06,1,Pilgrim knows how to progress,Thatcher facing ban,McIlroy calls for Irish fighting spirit,Leicester bin stadium blueprint,United braced for Mexican wave,"Auntie back in fashion, even if the dress look...",Shoaib appeal goes to the top,Hussain hurt by 'shambles' but lays blame on e...,...,Putin admits Yeltsin quit to give him a head s...,BBC worst hit as digital TV begins to bite,How much can you pay for...,Christmas glitches,"Upending a table, Chopping a line and Scoring ...","Scientific evidence 'unreliable', defence claims",Fusco wins judicial review in extradition case,Rebels thwart Russian advance,Blair orders shake-up of failing NHS,Lessons of law's hard heart
4,2000-01-07,1,Hitches and Horlocks,Beckham off but United survive,Breast cancer screening,Alan Parker,Guardian readers: are you all whingers?,Hollywood Beyond,Ashes and diamonds,Whingers - a formidable minority,...,Most everywhere: UDIs,Most wanted: Chloe lunettes,Return of the cane 'completely off the agenda',From Sleepy Hollow to Greeneland,Blunkett outlines vision for over 11s,"Embattled Dobson attacks 'play now, pay later'...",Doom and the Dome,What is the north-south divide?,Aitken released from jail,Gone aloft
5,2000-01-10,1,Fifth round draw,BBC unveils secret weapon in ratings war: Sout...,Second Division round-up,European round-up,Third Division round-up,Welfare could claim Killie tie as Caley Thistl...,Ferguson puts brave face on Rio meltdown,Southgate in striking form to pre-empt penalties,...,Time Warner and AOL to merge,Keep up,Waging global war,"Desktop icons, No 1: The Qwerty keyboard",The sec's files,The low down: Workplace bullying,Met 'not equipped' to solve murders,Tranmere tie will not be replayed,Rebel attacks take toll on Russia,Met lent stopped car to Lawrence
6,2000-01-11,1,Man Utd 2 - 0 South Melbourne,How North Atlantic drift could carry away Old ...,Buoyant BBC to show Brazil final live,Tranmere given all-clear in the Cup,United sit poorly with the Doc,Queen's Park peril clouds Hampden future,Waugh hits out at Shoaib reprieve,Knight makes case for Butcher's place,...,I'd like that in writing,Split vote may offer NatWest takeover escape,"Teaching is for stayers, not sprinters",A lesson in respect,'Now everyone knows we are a good school','What's wrong with giving teachers applause?','When I realised I'd won I felt sick with shock',No more 'tenderness' from stung Russian forces,Inspectors warn pressure might lead officers t...,Repairing Jack's house
7,2000-01-12,0,Newcastle seek new football supremo,Liverpool aim to speed up Heskey deal,Highlanders voted up,Edwards' power play suffers new blow,Chelsea gamble on Weah,Taylor settles the eternal tie,Tenth top-flight club falls as Hodge has final...,Charlton charge to top,...,Like-minded,Is it cos I is black?,"Megabucks, Out of luck, and What the...?",Cabinet battle rages over ethical foreign policy,Radio station becomes Talk of sport,A better breed of dad,Childish things,Kids: just say hopscotch,Smoke without fire,"Press reaction from Spain, Chile and Argentina"
8,2000-01-13,1,Bungling officials on the carpet,And in the red and raw corner it's 'Killer' Ma...,United put their shirts on �30m,England against plan for home nations' revival,Donald poised to quit Test scene,Adams stares into the abyss,Money money money,Tyson to enter Britain,...,"Dinner plates, Microwave ovens and Toast",No one can be simultaneously free yet live in ...,More doubt on Pakistan arms exports,Another fine mess,Cybershopping: sportswear,How much can you pay for . . .,Hillary holds her own on The Late Show,"Bye, bye American pi",Tension mounts as Straw stands by trial plans,Harrods loses Prince Philip's royal warrant
9,2000-01-14,1,Pompey plump for Pulis work ethic,Roma under fire over Rolexes for referees,Prenton Park Two told to take a break,"OK, I didn't figure in Rio but I'm still No1 w...",Chelsea tune in to Weah's world,Top storey awaits the Cottage,West Indies unveil fiery next generation,Donald's killing fields await edgy England,...,Schreiber: The man who would topple kings,Joe Ashton's letter of resignation,Ashton resigns from Wednesday board,Stravinsky: The Rake's Progress,The best waterfront scene,"Incompetence, Insult and Injury",Finding the time,England take six on rain-shortened first day,Media sale nets Chris Evans a �75m 'divvy',Scissors at dawn
