# NLP Practice

## Steps:
<b> 1. Import data from a file. </b><br/>
<b> 2. Take Column which contains comments/text data from dataframe. </b> <br/>
<b> 3. Clean all the text by performing following sub-steps: </b> <br/>
        &nbsp; 3.0 Tokenization — convert sentences to words <br/>
        &nbsp; 3.1 Remove punctuations and symbols <br/>
        &nbsp; 3.2 Remove stop words and normalize by converting into upper or lower<br/>
        &nbsp; 3.3 Stemming — words are reduced to a root by removing inflection through dropping unnecessary characters, usually a suffix. <br/>
        &nbsp; 3.4 Lemmatization — Another approach to remove inflection by determining the part of speech and utilizing detailed database of the language. <br/>
        &nbsp; 3.5 Part-of-Speech (POS) Tagging — You can decide whether to remove some words based on their tags like Verbs, Adjectives, Pro-nouns etc. You can use this step as a filter to remove extra words. <br/> 
<b> 4. Create a bag-of-words (BOW) model with N-grams (Uni,bi, tri etc): </b><br/>
        &nbsp; 4.1 Use Count Vectorizer (How many times a word occur in a document or row)<br/>
        &nbsp; 4.2 Use Tf-IDF vectorizer (word frequency (TF) / Inverse Document Frequency (Total No of documents / Documents containing t term))<br/>
<b>NOTE:</b> <br/>
        &nbsp; <i>a.</i> If you have an output variable, then use appropriate machine learning model <br/>
            &nbsp; &nbsp; (Usually all NLP related problems are classification problems known as Text Classification) <br/>
        &nbsp; <i>b.</i> If you do not have a output variable (like you want to do sentiment analysis), then:<br/>
            &nbsp; &nbsp; <i>b.1)</i> Use pre-trained model (from TextBlob library) <br/>
            &nbsp; &nbsp; <i>b.2)</i> Use clustering machine learning model (Clustering algorithm like K-means) <br/>        
<b> 5. Split data into training and test data </b><br/>
<b> 6. Give it to machine learning </b><br/>
<b> THE END </b>

In [1]:
# 2 libraries that are being used most frequently
# 1) NLTK
# 2) Spacy

In [20]:
# Import all libraries
import pandas as pd
import nltk
import re # Regular expression
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

In [3]:
# Read data from file
dataset = pd.read_csv('Restaurant_Reviews.tsv', sep='\t')

In [4]:
dataset.head()

Unnamed: 0,Review,Liked
0,Wow... Loved this place.,1
1,Crust is not good.,0
2,Not tasty and the texture was just nasty.,0
3,Stopped by during the late May bank holiday of...,1
4,The selection on the menu was great and so wer...,1


In [21]:
# download stop words from nltk 
# nltk.download('stopwords')

# Create an object of lemmatizer 
lem = WordNetLemmatizer()

In [30]:
# Build a corpus
corpus=[] # To store cleaned text/comments

for i in range(len(dataset)):
    # str1 = "Hello, my name is Vishwa Patel troubled @ $ 123."
    str1 = str(dataset["Review"].iloc[i])
    
    # Step1: Remove punctuations and symbols
    str1 = re.sub('[^a-zA-Z0-9]', ' ', str1)
    # print("Punctuations removed: ",str1)

    str1 = " ".join(str1.split()) # removing extra spaces from the text

    # Step-2: To upper or lower case (Normalize text data)
    str1 = str1.lower()
    
    # Step-3: Remove stop words
    str1 = [i for i in str1.split() if i not in set(stopwords.words('english'))]
    str1 = " ".join(str1)
    # print(str1)


    # Step-4: Lemmatizing each words
    str1 = " ".join([lem.lemmatize(i) for i in str1.split()])
    
    # Step-5: Append clean text in corpus list
    corpus.append(str1)


In [32]:
# Create a Bag-of_Words model with corpus
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(ngram_range= (1,1), max_features=None)

In [39]:
# Fit and transform data 
X = cv.fit_transform(corpus).toarray()

In [46]:
# If you want to view Count Vectorized Matrix with their column names then use following code
df = pd.DataFrame(X, columns=cv.get_feature_names())

In [48]:
y = dataset.iloc[:, 1].values
# Split the data into training and testing
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

In [49]:
# Create a machine learning model (Classification)
from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()

In [50]:
# Fit the data into model
classifier.fit(X_train, y_train)

GaussianNB(priors=None, var_smoothing=1e-09)

In [51]:
# Predict the results using X_test
y_pred = classifier.predict(X_test)

In [52]:
# Measure the performane of the model
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
print(cm)

[[48 49]
 [14 89]]
