# Generic vs Customised Algorithm

>Generic text mining algorithm might not work well on all types of data 

>We need to train the model/dictionary/DTM/ to make the model  accurate

>You will have low accuracy with generic parameters

# Movie review data

In [1]:
#importing the libraries
import numpy as np
import pandas as pd

In [2]:
input_data = pd.read_csv('User_movie_review.csv')

In [3]:
input_data.head()

Unnamed: 0,class,text
0,Pos,stuart little is one of the best family ...
1,Neg,a movie like mortal kombat annihilation wor...
2,Neg,and just when you thought joblo was getting a...
3,Pos,every now and then a movie comes along from a...
4,Neg,for about twenty minutes into mission impossi...


In [4]:
input_data.columns

Index(['class', 'text'], dtype='object')

In [5]:
# displaying the 10 full reviews
for i in range(10):
    print(input_data['text'][i])
    print('\n')
    

    stuart little   is one of the best family films to come out this year    it s a cute   funny and very good natured film that has nothing for parents to squirm over except a few mild cusswords    though i read the book a long time ago and i really do not remember what it was about   i do know that this film does not disappoint    finally a movie gets released that is as good as the trailer makes it to be   with a few surprising twists   some very funny moments   and a few sentimental moments all mixed in to one great little movie    stuart little is a mouse    he has finally gotten a new home after being put up for adoption   he now lives with the littles    a nice little   no pun intended   family that lives in their apartment next to central park in new york city    they have a little boy george   played by the adorable jonathan lipnicki   and now they have a new son    at first stuart takes a while but he finally adjusts to being part of the family and even getting along with the

In [6]:
# details of the data

In [7]:
input_data.shape

(2000, 2)

In [8]:
input_data.columns

Index(['class', 'text'], dtype='object')

In [9]:
# frequency of sentiments
input_data['class'].value_counts()

Pos    1000
Neg    1000
Name: class, dtype: int64

In [10]:
# to check how many total positive and negative labelled text are there in the dataset
input_data.groupby('class').count()

Unnamed: 0_level_0,text
class,Unnamed: 1_level_1
Neg,1000
Pos,1000


In [11]:
import seaborn as sns

# Creating Document Term Matrix(DTM)

In [29]:
from sklearn.feature_extraction.text import CountVectorizer

In [30]:
countvec1 = CountVectorizer()
dtm_v1 = pd.DataFrame(countvec1.fit_transform(input_data['text']).toarray(), columns = countvec1.get_feature_names(), index = None)
dtm_v1['class'] = input_data['class']

In [31]:
dtm_v1.head()

Unnamed: 0,00,000,0009f,007,00s,03,04,05,05425,10,...,zukovsky,zulu,zundel,zurg,zus,zweibel,zwick,zwigoff,zycie,zzzzzzz
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## What exactly is Document Term Matrix(DTM)

A document-term matrix or term-document matrix is a mathematical matrix that describes the frequency of terms that occur in a collection of documents. In a document-term matrix, rows correspond to documents in the collection and columns correspond to terms.
> I like to think Document Term Matrix (DTM) as a implementation of the Bag of Words concept.  Document Term Matrix is tracking the term frequency for each term by each document

>Term count is a common metric to  use in a Document Term Matrix  but it is not the only metric. In a future post I will discuss some common functions used to calculate Term Frequencies but for this post I will use term counts.

> For example, the word intelligent exists in Doc 1 twice, Doc 2 once, and not at all in Doc 3. A Document Term matrix can become a very large, sparse matrix depending on the number of documents in the corpus and the number of terms in each document.

>The DTM representation is a fairly simple way to represent the documents as a numeric structure.

### Some points to take care
> We have not taken care of words which should not be present or which should not be in sentiment analysis
for example "i love python and python language is very good" this sentence can be converted to "love python language very good. Words like i, and is should not be evaluated during calcultion of Document Term Matrix

In [32]:
# Importing the important libraries
import pandas as pd
import re
import nltk
from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords

## Refining The Document Term Matrix

> Final result depend on how mush you have refined your document ter matrix(DTM)

> Things we can do....

    > *Removing numbers*
    
    > *removing punctuations*
  
    > *removing stopwords*
  
    > *Stemming*

In [33]:
# Creating a function that will remove all the unnecessary words from the sentences
# and will keep only the important words
stemmer = PorterStemmer()
def tokenise(text):
    text = stemmer.stem(text);
    text = re.sub(r'\W+|\d+|_', ' ', text) # this line removex all the punctuation, digits, and underscores from the text
    tokens = nltk.word_tokenize(text)
    return tokens

In [34]:
countvec = CountVectorizer(min_df = 5, tokenizer = tokenise, stop_words = stopwords.words('english'))
dtm = pd.DataFrame(countvec.fit_transform(input_data['text']).toarray(), columns = countvec.get_feature_names(), index = None)
dtm['class'] = input_data['class']

In [35]:
type(input_data['text'])

pandas.core.series.Series

In [36]:
print(input_data['text'][0])

    stuart little   is one of the best family films to come out this year    it s a cute   funny and very good natured film that has nothing for parents to squirm over except a few mild cusswords    though i read the book a long time ago and i really do not remember what it was about   i do know that this film does not disappoint    finally a movie gets released that is as good as the trailer makes it to be   with a few surprising twists   some very funny moments   and a few sentimental moments all mixed in to one great little movie    stuart little is a mouse    he has finally gotten a new home after being put up for adoption   he now lives with the littles    a nice little   no pun intended   family that lives in their apartment next to central park in new york city    they have a little boy george   played by the adorable jonathan lipnicki   and now they have a new son    at first stuart takes a while but he finally adjusts to being part of the family and even getting along with the

In [37]:
dtm.shape

(2000, 13053)

In [38]:
# we can clearly see that columns now are reduced to only 13053 and that's a great achievemnt

### Building the Training and Testing data

In [39]:
df_train = dtm[:1900]
df_test = dtm[1900:]

In [40]:
df_train.shape

(1900, 13053)

In [41]:
df_test.shape

(100, 13053)

In [42]:
# we have 1600 rows in our training data and 400 rows in our test data

In [43]:
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB()
X_train = df_train.drop(['class'], axis = 1)
clf.fit(X_train, df_train['class'])

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [44]:
# finding accuracy
X_test = df_test.drop(['class'], axis = 1)
clf.score(X_test, df_test['class'])

0.8

In [45]:
# predicting the sentiments
pred_sentiment = clf.predict(df_test.drop('class', axis = 1))

In [46]:
type(df_test)

pandas.core.frame.DataFrame

In [47]:
print(pred_sentiment)

['Pos' 'Pos' 'Pos' 'Neg' 'Neg' 'Neg' 'Neg' 'Neg' 'Neg' 'Neg' 'Neg' 'Pos'
 'Neg' 'Neg' 'Neg' 'Pos' 'Pos' 'Pos' 'Neg' 'Pos' 'Neg' 'Neg' 'Pos' 'Neg'
 'Pos' 'Pos' 'Neg' 'Pos' 'Neg' 'Neg' 'Neg' 'Pos' 'Neg' 'Pos' 'Pos' 'Pos'
 'Pos' 'Pos' 'Neg' 'Pos' 'Pos' 'Neg' 'Pos' 'Pos' 'Pos' 'Neg' 'Pos' 'Neg'
 'Neg' 'Pos' 'Pos' 'Neg' 'Pos' 'Neg' 'Neg' 'Neg' 'Pos' 'Neg' 'Pos' 'Pos'
 'Neg' 'Neg' 'Pos' 'Neg' 'Neg' 'Pos' 'Neg' 'Pos' 'Neg' 'Neg' 'Pos' 'Neg'
 'Neg' 'Pos' 'Pos' 'Neg' 'Neg' 'Neg' 'Neg' 'Neg' 'Neg' 'Pos' 'Pos' 'Pos'
 'Pos' 'Pos' 'Neg' 'Pos' 'Pos' 'Neg' 'Pos' 'Neg' 'Neg' 'Pos' 'Neg' 'Neg'
 'Neg' 'Pos' 'Pos' 'Pos']


In [48]:
for i, j in zip(df_test['class'], pred_sentiment):
    print(i, j, sep = '-->')

Pos-->Pos
Pos-->Pos
Pos-->Pos
Pos-->Neg
Pos-->Neg
Neg-->Neg
Neg-->Neg
Neg-->Neg
Neg-->Neg
Neg-->Neg
Neg-->Neg
Pos-->Pos
Neg-->Neg
Neg-->Neg
Neg-->Neg
Neg-->Pos
Pos-->Pos
Neg-->Pos
Pos-->Neg
Pos-->Pos
Neg-->Neg
Neg-->Neg
Neg-->Pos
Pos-->Neg
Pos-->Pos
Pos-->Pos
Neg-->Neg
Pos-->Pos
Pos-->Neg
Neg-->Neg
Neg-->Neg
Pos-->Pos
Pos-->Neg
Pos-->Pos
Pos-->Pos
Pos-->Pos
Pos-->Pos
Pos-->Pos
Pos-->Neg
Neg-->Pos
Pos-->Pos
Neg-->Neg
Pos-->Pos
Neg-->Pos
Pos-->Pos
Neg-->Neg
Pos-->Pos
Neg-->Neg
Neg-->Neg
Pos-->Pos
Neg-->Pos
Neg-->Neg
Pos-->Pos
Neg-->Neg
Pos-->Neg
Neg-->Neg
Pos-->Pos
Neg-->Neg
Pos-->Pos
Pos-->Pos
Neg-->Neg
Neg-->Neg
Pos-->Pos
Neg-->Neg
Neg-->Neg
Pos-->Pos
Neg-->Neg
Pos-->Pos
Neg-->Neg
Pos-->Neg
Neg-->Pos
Neg-->Neg
Pos-->Neg
Pos-->Pos
Pos-->Pos
Neg-->Neg
Neg-->Neg
Neg-->Neg
Neg-->Neg
Pos-->Neg
Neg-->Neg
Pos-->Pos
Pos-->Pos
Pos-->Pos
Pos-->Pos
Pos-->Pos
Neg-->Neg
Pos-->Pos
Pos-->Pos
Neg-->Neg
Pos-->Pos
Neg-->Neg
Neg-->Neg
Pos-->Pos
Pos-->Neg
Neg-->Neg
Neg-->Neg
Pos-->Pos
Neg-->Pos
Pos-->Pos


In [76]:
text = "Superb movie. Very good story and wonderful presentation"
a = pd.Series(text)
countvec1 = CountVectorizer(min_df=0.1, tokenizer = tokenise, stop_words = stopwords.words('english'))
dtm_pred = pd.DataFrame(countvec1.fit_transform(a).toarray(), columns= countvec1.get_feature_names(), index = None)

In [77]:
dtm_pred

Unnamed: 0,good,movie,present,story,superb,wonderful
0,1,1,1,1,1,1


In [78]:
pred_new_sentiment = clf.predict(dtm_pred)

ValueError: shapes (1,6) and (13052,2) not aligned: 6 (dim 1) != 13052 (dim 0)