# NLP WITH TF IDF

<img src="https://miro.medium.com/max/714/1*wgKxsWlT3ifsZmNy-DihFQ.png">

In this research,We are going to use TfidfVectorizer which assigns a vector to each word. Also, we'll look up how to use named entity recognizer, worldcloud.

### LETS GO! 

<img src="https://ichi.pro/assets/images/max/724/0*C-cPP9D2MIyeexAT.gif">

# DATA SET

### Our dataset consists of reviews and their types. We are trying to find the sentence is negative or positive

Columns: 
> Review : how do films like mouse hunt get into theatres ? isn't there a law or something ? 

> Label  :  negative(neg)

Classes : 
* negative(neg)
* positive(pos)

# LIBRARIES

In [None]:
import numpy as np 
import pandas as pd 
import seaborn as sns
import matplotlib.pyplot as plt
import os
import re
import spacy
from spacy import displacy
import nltk
from wordcloud import WordCloud
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.metrics import confusion_matrix,classification_report,plot_confusion_matrix
from sklearn.metrics import plot_precision_recall_curve,plot_roc_curve

### IMPORT DATA

In [None]:
text = pd.read_csv('../input/positiveornegative/text.csv')
text.drop('Unnamed: 0',axis=1,inplace=True)
text.head()

#### CHECKING TYPES

In [None]:
text.dtypes

# TEXT CLEANING
#### 1.Punctuations are often unnecessary as it doesn’t add value or meaning to the NLP model.
> Example of punctuations : !"#,,%,&,',(),*,+, -./,:;<,=,>,? etc.
#### 2.Tokenizing is the process of splitting strings into a list of words. We will make use of Regular Expressions or regex to do the splitting.Regex can be used to describe a search pattern.
> Example for tokenizing : [ 'I love Programming' ]->[ 'I' , 'love' , 'Programming' ] 
#### 3.Stop words are irrelevant words that won’t help in identifying a text as NEGATIVE or POSITIVE. 
> Example for stopword : 'i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"

In [None]:
nlp = spacy.load('en_core_web_sm')
stopword = nltk.corpus.stopwords.words('english')
def text_cleaning(text):
    text = re.sub(r'[^\w\s]', '',str(text))             #Punctuations
    text=re.split("\W+",text)                           #Tokenizing
    text=[word for word in text if word not in stopword]#Stop words
    text = ' '.join(text)                              
    return text


>  APPLYING FUNCTION TO OUR TEXT DATA

In [None]:
text.review = text.review.apply(lambda x :text_cleaning(x))
text

> LETS DEFINE FUNCTION FOR MISSING VALUES

In [None]:
def missing_values(dataframe): 
    drop_list = []  
    for i,j,k in dataframe.itertuples(): 
        if type(k)==str:            
            if k.isspace():         
                drop_list.append(i)

    dataframe.drop(drop_list,axis=0,inplace=True)
    return dataframe

In [None]:
text = missing_values(text)

In [None]:
#length of our data set 
len(text)

> Now we have ready dataset for model but before this lets explore some awesome visualization. What can we do with words?

<img src="https://thumbs.gfycat.com/CrazyIllinformedAstrangiacoral-size_restricted.gif">

# Named Entity Recognition 

Named entity recognition (NER) helps you easily identify the key elements in a text, like names of people, places, brands, monetary values, and more. Extracting the main entities in a text helps sort unstructured data and detect important information, which is crucial if you have to deal with large datasets

In [None]:
text_instance = nlp(u'Mark Zuckerberg is one of the founders of Facebook, a company from the United States”')
for sentence in text_instance.sents:
    docx = nlp(sentence.text)
    if docx.ents:
        displacy.render(docx, style='ent', jupyter=True)
    else:
        print(docx.text)

In [None]:
text_instance = nlp(u"Apple's first company logo featured a drawing of the father of physics, Sir Isaac Newton. To raise capital for Apple, co-founder Steve Wozniak had to sell his scientific calculator.")
for sentence in text_instance.sents:
    docx = nlp(sentence.text)
    if docx.ents:
        displacy.render(docx, style='ent', jupyter=True)
    else:
        print(docx.text)

In [None]:
text_instance = nlp(u" There are 32 teams in the NFL, each vying for the Super Bowl win at the end of the season. The first game to be televised was between the Philadelphia Eagles and the Brooklyn Dodgers in 1939. There were approximately 500 television sets in new York able to play the game.")
for sentence in text_instance.sents:
    docx = nlp(sentence.text)
    if docx.ents:
        displacy.render(docx, style='ent', jupyter=True)
    else:
        print(docx.text)

# VISUALIZATION

> We have same amount of input for both label

In [None]:
sns.countplot(x="label",data=text)

### Word Cloud Usage

In [None]:
cloud_data = ["NLP","CNN","ANN","RNN","Deep Learning","Machine Learning","OpenCV","Tokenizing","StopWords","Punctuations","TFIDF","CountVector","NLPTK","Pipeline","Performance Metrics"]
wordcloud = WordCloud(width = 400, height = 400,
                background_color ='white',
                stopwords = stopword,
                min_font_size = 10).generate(' '.join(cloud_data))
  
# plot the WordCloud image                       
plt.figure(figsize = (8, 8), facecolor = None)
plt.imshow(wordcloud)
plt.axis("off")
plt.tight_layout(pad = 0)
  
plt.show()

Just change cloud_data with your own words and make your own wordcloud 😊✌🏻

### TEST AND TRAIN

In [None]:
input_data = text['review']
output_data = text['label']

train_data, test_data, train_output, test_output = train_test_split(input_data, output_data, test_size=0.2, random_state=101)

#### TfidfVectorizer TF-IDF is a statistical measure that evaluates how relevant a word is to a document in a collection of documents. This is done by multiplying two metrics: how many times a word appears in a document, and the inverse document frequency of the word across a set of documents. It has many uses, most importantly in automated text analysis, and is very useful for scoring words in machine learning algorithms for Natural Language Processing (NLP).



<img src= "https://miro.medium.com/max/3604/1*qQgnyPLDIkUmeZKN2_ZWbQ.png">

In [None]:
tfidf = TfidfVectorizer()
tfidf_text_train = tfidf.fit_transform(train_data.values.astype('U'))
tfidf_text_test = tfidf.transform(test_data.values.astype('U'))

> Lets see how it works

In [None]:
text_example = 'This is very good example'
print(tfidf.transform([text_example]))

# MODEL

In [None]:
linear_svc = LinearSVC()
linear_svc.fit(tfidf_text_train,train_output)
predictions = linear_svc.predict(tfidf_text_test)

# PIPELINE

#### Instead of using both TF IDF and SVM apart from each other we can use pipeline for doing both at the same time. 

In [None]:
positive_or_negative = Pipeline([('tfidf', TfidfVectorizer()),
                     ('linear_svc', LinearSVC()),])

positive_or_negative.fit(train_data,train_output)

In [None]:
pred = positive_or_negative.predict(test_data)

# PERFORMANCE METRICS

In [None]:
plot_confusion_matrix(linear_svc,tfidf_text_test,test_output)

In [None]:
print(classification_report(test_output,pred))

In [None]:
plot_precision_recall_curve(linear_svc,tfidf_text_test,test_output)

In [None]:
plot_roc_curve(linear_svc,tfidf_text_test,test_output)

> It seems like model works well

# SINGLE TEST

In [None]:
single_test_text = "very bad movie. Don't watch it"
positive_or_negative.predict([single_test_text])

In [None]:
single_test_text = "It is very delicious. I strongly recommend that"
positive_or_negative.predict([single_test_text])

### We are good so far. 👌🏻

<img src="https://pa1.narvii.com/6292/ce507693b01eb658851d77a7b601902509c63b03_hq.gif">

### I hope you enjoy 😊