## Natural Language Processing (NLP)

The goal of **Natural language processing (NLP)**, a branch of artificial intelligence, is to teach machines to comprehend human languages. 
 - NLP is used in various **real-world applications, such as Google Translator, Siri, and chatbots.** 
 - We should adhere to **a procedure to generate a vocabulary of terms from a textual dataset before beginning any NLP-based challenge.**
 - Thus, this file is for you if you wish to comprehend how to solve any problem using NLP. 
 - I will walk you through the entire NLP process using Python in this file.

**The steps to solve any NLP problem are:**

- Finding a dataset for sentiment classification
- Preparing the dataset by tokenization, stopwords removal, and stemming
- Text vectorization
- Training a classification model for sentiment classification.

 ### 1) Finding a Dataset

- Locating a textual dataset is the initial step in solving any NLP problem. To solve this challenge, we must locate a dataset that includes text describing people's opinions about a certain good or service. It's ideal if the dataset you found is labelled.
- For the emotion classification , on Kaggle, I discovered the perfect dataset based on movie reviews. 

- Now that we have a dataset for sentiment classification, let's import the dataset and the required Python libraries to continue.

In [1]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import BernoulliNB
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\asus\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [2]:
df = pd.read_csv("C:/Users/asus/OneDrive/Desktop/ML_Datasets/project/More_Projects/IMDB_Dataset.csv")
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [3]:
df['sentiment'].unique()

array(['positive', 'negative'], dtype=object)

### 2) Data Preparation, Tokenization, Stopwords Removal and Stemming

- The textual data we have now needs preparation before being used for any problem. The number of things we will do here are:
  - remove links and all the special characters from the review column.
  - tokenize and remove the stopwords from the review column.
  - stem the words in the review column.

In [4]:
import nltk
import re
nltk.download('stopwords')
stemmer = nltk.SnowballStemmer("english")
from nltk.corpus import stopwords
import string
stopword=set(stopwords.words('english'))

def clean(text):
    text = str(text).lower()
    text = re.sub('\[.*?\]', '', text)
    text = re.sub('https?://\S+|www\.\S+', '', text)
    text = re.sub('<.*?>+', '', text)
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    text = re.sub('\n', '', text)
    text = re.sub('\w*\d\w*', '', text)
    text = [word for word in text.split(' ') if word not in stopword]
    text=" ".join(text)
    text = [stemmer.stem(word) for word in text.split(' ')]
    text=" ".join(text)
    return text
df["review"] = df["review"].apply(clean)
df.head()

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\asus\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Unnamed: 0,review,sentiment
0,one review mention watch oz episod youll hook...,positive
1,wonder littl product film techniqu unassum old...,positive
2,thought wonder way spend time hot summer weeke...,positive
3,basic there famili littl boy jake think there ...,negative
4,petter mattei love time money visual stun film...,positive


### 3) Text Vectorization

The next step is text vectorization. 
- It means to transform all the text tokens into numerical vectors. 
- Here I will first perform text vectorization on the feature column (review column) and then split the data into training and test sets:

In [5]:
x = np.array(df["review"])
y = np.array(df["sentiment"])
cv = CountVectorizer()
X = cv.fit_transform(x)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

### 4) Text Classification

The final step in the process of NLP is to classify or cluster texts. 
- As we are working on the problem of sentiment classification, we will now train a text classification model. 
- Here’s how to prepare a text classification model for sentiment classification:

In [6]:
from sklearn.linear_model import PassiveAggressiveClassifier
model = PassiveAggressiveClassifier()
model.fit(X_train,y_train)

The dataset we used to train a sentiment classification model contains movie reviews. 
So let’s test the model by giving a movie review as an input:

In [7]:
user = input("Enter a Text: ")
data = cv.transform([user]).toarray()
output = model.predict(data)
print(output)

Enter a Text: This was nither a bad movie nor a good movie
['negative']


**So this is how you can solve any problem of NLP using the Python programming language.**