# Natural Language Processing(NLP) for Text Classification Project 

# Introduction

welcome to the documentation/coding for NLP- based Text Classification project. This project focus on leveraging Natural Language Processing techniques to automatically categorize and classify text data into predefined categories or labels. Text classification plays a crucial role in various applications, including sentiments analysis, sparm detection, and content categorization.

# Project Goals

The primary goals of this project are as follows: 
    
  1. Develop a robust text Classification model capable of accurately categorizing input text into predefined classes. 
  2. Explore and implement state-of-the-art NLP techniques to enhance the model's understanding of textual data.
  3. Provide a scalable and efficient solution for automated text classification task.

# Import libraries

In [7]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns


here, 'numpy', 'pandas', 'matplotlib', 'seaborn' are some libraries used for data analysis and visualization where as np, pd, plt, and sns aliases fpr brevity. this is a common conventions.

# Uploaded dataset

'pd.read_csv()' function is generally used for loading dataset into a pandas Dataframe.

In [8]:
df= pd.read_csv("D:\priya\internship\IMDB Dataset.csv")

now, we select first 10,000 rows of the Dataframe 'df' and assigns it to a new variable named 'data'. afterall, we can work with the subset of the data contained in the 'data' variable for further analysis or processing.

In [9]:
data= df.iloc[:10000]

here, after slicing the Dataframe and assigning it to the variable 'data'. we used .head() method to display first few rows of the Dataframe.

In [10]:
data.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


we accessed the value in the 'review' column for the second row (index1) in the dataframe 'data'.

In [11]:
data['review'][1]

'A wonderful little production. <br /><br />The filming technique is very unassuming- very old-time-BBC fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire piece. <br /><br />The actors are extremely well chosen- Michael Sheen not only "has got all the polari" but he has all the voices down pat too! You can truly see the seamless editing guided by the references to Williams\' diary entries, not only is it well worth the watching but it is a terrificly written and performed piece. A masterful production about one of the great master\'s of comedy and his life. <br /><br />The realism really comes home with the little things: the fantasy of the guard which, rather than use the traditional \'dream\' techniques remains solid then disappears. It plays on our knowledge and our senses, particularly with the scenes concerning Orton and Halliwell and the sets (particularly of their flat with Halliwell\'s murals decorating every surface) are terribly well d

we used the '.value_counts()' method on the 'sentiment' column, showing the distribution of sentiments in our dataset.

In [12]:
data['sentiment'].value_counts()

sentiment
positive    5028
negative    4972
Name: count, dtype: int64

'.isnull().sum()' is checking for the presence of null(missing) value in each column of the Dataframe 'data' and then, summing up the counts of these null values. if the count is 0 for all columns, it means there are no missing values.

In [7]:
data.isnull().sum()

review       0
sentiment    0
dtype: int64

now, we will check all duplicates value in all rows of dataframe 'data'. if the counts is 0, it means there are no duplicate rows. if there are non 0, it indicates the number of rows are identical to other rows in Dataframe.

In [13]:
data.duplicated().sum()

17

here, '.drop_duplicates' method are using on the dataframe 'data' to remove duplicated rows and modifying the dataframe in place setting 'inplace=True'

In [14]:
data.drop_duplicates(inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data.drop_duplicates(inplace=True)


now, we removed all duplicates rows and finally get 0 duplicated rows in dataframe 'data'.

In [15]:
data.duplicated().sum()

0

# Basic preprocessing
Remove tags
Lowercase
Remove stopwords

here, we define a function named 'remove_tags' that uses the 're' module (regular expresssions) to remove HTML tags from a given 'raw_text'. 

In [16]:
import re
def remove_tags(raw_text):
    cleaned_text= re.sub(re.compile('<.*?>'),'', raw_text)
    return cleaned_text

now, we are attempting to apply the 'remove_tags' function to 'review' column of dataframe 'data'. however, there's a syntax issue in our substraction operation. if we want to create a new column with the cleaned text(without HTML tags).

In [17]:
data['review']=data['review'].apply(remove_tags)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['review']=data['review'].apply(remove_tags)


we are checking dataframe 'data' after preprocessing.

In [18]:
data

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. The filming tec...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive
...,...,...
9995,"Fun, entertaining movie about WWII German spy ...",positive
9996,Give me a break. How can anyone say that this ...,negative
9997,This movie is a bad movie. But after watching ...,negative
9998,This is a movie that was probably made to ente...,negative


here, we are using the 'apply()' method along with lambda function to convert the text in the 'review' column of the dataframe 'data' to lowercase. this operation will transform all text in the 'review' column to lowercase.

In [19]:
data['review']= data['review'].apply(lambda x:x.lower())

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['review']= data['review'].apply(lambda x:x.lower())


now, we check dataframe 'data' for applied method is working or not

In [20]:
data

Unnamed: 0,review,sentiment
0,one of the other reviewers has mentioned that ...,positive
1,a wonderful little production. the filming tec...,positive
2,i thought this was a wonderful way to spend ti...,positive
3,basically there's a family where a little boy ...,negative
4,"petter mattei's ""love in the time of money"" is...",positive
...,...,...
9995,"fun, entertaining movie about wwii german spy ...",positive
9996,give me a break. how can anyone say that this ...,negative
9997,this movie is a bad movie. but after watching ...,negative
9998,this is a movie that was probably made to ente...,negative


there is a small typo in our import statement. nltk is stands for natural language toolkit. this code improt nltk library and then downloads the stopwords dataset, which is commonly used in Natural Language Processing(NLP) task for filtering out common words that usually don't carry much information.

In [21]:
import nltk
nltk.download('stopwords')

[nltk_data] Error loading stopwords: <urlopen error [Errno 11001]
[nltk_data]     getaddrinfo failed>


False

we create a lambda function that splits each review into words, filters out the stop words, and joins the remaining words back into sentences. we apply this lambda function to the 'review' column of the Dataframe, modifying the content to exclude english stop words.

In [22]:
from nltk.corpus import stopwords
sw_list=stopwords.words('english')
data['review']=data['review'].apply(lambda x:[item for item in x.split() if item not in sw_list]).apply(lambda x:" ".join(x))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['review']=data['review'].apply(lambda x:[item for item in x.split() if item not in sw_list]).apply(lambda x:" ".join(x))


checking dataframe 'data'.

In [23]:
data


Unnamed: 0,review,sentiment
0,one reviewers mentioned watching 1 oz episode ...,positive
1,wonderful little production. filming technique...,positive
2,thought wonderful way spend time hot summer we...,positive
3,basically there's family little boy (jake) thi...,negative
4,"petter mattei's ""love time money"" visually stu...",positive
...,...,...
9995,"fun, entertaining movie wwii german spy (julie...",positive
9996,"give break. anyone say ""good hockey movie""? kn...",negative
9997,movie bad movie. watching endless series bad h...,negative
9998,"movie probably made entertain middle school, e...",negative


here, we sperated the review and sentiment columns in x and y value

In [25]:
x= data.iloc[:,0:1]
y=data['sentiment']

display x

In [26]:
x

Unnamed: 0,review
0,one reviewers mentioned watching 1 oz episode ...
1,wonderful little production. filming technique...
2,thought wonderful way spend time hot summer we...
3,basically there's family little boy (jake) thi...
4,"petter mattei's ""love time money"" visually stu..."
...,...
9995,"fun, entertaining movie wwii german spy (julie..."
9996,"give break. anyone say ""good hockey movie""? kn..."
9997,movie bad movie. watching endless series bad h...
9998,"movie probably made entertain middle school, e..."


display y

In [21]:
y

0       positive
1       positive
2       positive
3       negative
4       positive
          ...   
9995    positive
9996    negative
9997    negative
9998    negative
9999    positive
Name: sentiment, Length: 9983, dtype: object

# Feature Engineering

here, this code is using for 'LabelEncoder' to transform categorial labels in variable 'y' into numerical values. the 'fit_transform' method fits the encoder to the unique labels in 'y' and transforms them into numerical representation.

In [22]:
from sklearn.preprocessing import LabelEncoder
encoder=LabelEncoder()
y= encoder.fit_transform(y)

now. after displaying y.. we can see that every positive and negative are displaying as 0 or 1 in numerical values.

In [23]:
y

array([1, 1, 1, ..., 0, 0, 1])

the code uses the 'train_test_split' function from scikit-learn and testing sets. the 'test_size' parameter specifices the proportion of the dataset to include in the test split, and 'random_state' ensures reproducibility by fixing the random seed.

In [53]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test= train_test_split(x, y, test_size=0.2, random_state=1)

for knowing the shape of X_train

In [54]:
X_train.shape

(7986, 1)

X_test information

In [55]:
X_test.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1997 entries, 5333 to 2573
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   review  1997 non-null   object
dtypes: object(1)
memory usage: 31.2+ KB


applying BoW

here, we are using scikit-learn where CountVectorizer() function is useful to convert text data into bag-of-words representation. 
this code intializes a 'CountVectorizer' fits it to the training data (X_train['review']) tranform both the training and testing data into bag-of-words representations, and then checks the shape of training data as X_train_bow.shape. 

In [56]:
from sklearn.feature_extraction.text import CountVectorizer

In [57]:
cv= CountVectorizer()

In [58]:
X_train_bow = cv.fit_transform(X_train['review']).toarray()
X_test_bow = cv.transform(X_test['review']).toarray()

In [59]:
X_train_bow.shape

(7986, 48282)

now, we are using scikit-learn's Gaussian Naive Bayes Classifier('GaussianNB') and fitting it to training data. 
This code initializes a Gaussian Naive Bayes classifier, 'gnb', and fits it to the bag-of-words representations of the training data('X_train_bow') with corresponding labels ('y_train').

In [60]:
from sklearn.naive_bayes import GaussianNB
gnb= GaussianNB()
gnb.fit(X_train_bow,y_train)

this code will calculates predictions ('y_pred') using the trained Gaussian Naive bayes model on the test data('X_test_bow') and then computes the accuracy using 'accuracy_score' from scikit-learn.

In [61]:
y_pred=gnb.predict(X_test_bow)

from sklearn.metrics import accuracy_score, confusion_matrix
accuracy_score(y_test, y_pred)

0.6324486730095142

for printing the confusion matrix, which is a table showing the number of true positive, true negative, false positive, and false negative predictions, it's a useful tool for evaluating the performance of classiication model.


In [63]:
confusion_matrix(y_test, y_pred)

array([[717, 235],
       [499, 546]], dtype=int64)

1. added the correct syntax to create a 'RandomForestClassifier' instance ('rf=RandomForestClassifier()').
2. corrected the variable name from 'f.predict' to 'rf.predict'.
3. calculating accuracy using 'accuracy_score' and stored it in accuracy

In [66]:
from sklearn.ensemble import RandomForestClassifier
rf=RandomForestClassifier()

In [67]:
rf.fit(X_train_bow, y_train)
y_pred=rf.predict(X_test_bow)
accuracy_score(y_test, y_pred)

0.8517776664997496

we are trying to limit the number of features with 'CountVectorizer' and then we use a RandomForestClassifier.
1. set 'max_features' parameter correctly in 'CountVectorizer' .
2. Corrected variable names, replacing '-' with '=' for assignments.
3. used the correct syntax for 'y_pred' and 'accuracy'

In [68]:
cv=CountVectorizer(max_features=3000)
X_train_bow = cv.fit_transform(X_train['review']).toarray()
X_test_bow = cv.transform(X_test['review']).toarray()

rf=rf=RandomForestClassifier()
rf.fit(X_train_bow, y_train)
y_pred=rf.predict(X_test_bow)
accuracy_score(y_test, y_pred)

0.8377566349524287

this code uses a 'CountVectorizer' with n-grams and a limit on the number of features, and it trains a RandomForestClassifier on the resulting bag-of-words representation of the training data. the accuracy of the model is then evaluated on the test data.
1. corrected the assinment operator from '-' to '=' for 'max_features'.
2. specified the 'ngram_range' parameter as '(1,3)' to include unigrams, bigrams, and trigrams.

In [69]:
cv=CountVectorizer(ngram_range=(1,3),max_features=5000)
X_train_bow = cv.fit_transform(X_train['review']).toarray()
X_test_bow = cv.transform(X_test['review']).toarray()

rf=rf=RandomForestClassifier()
rf.fit(X_train_bow, y_train)
y_pred=rf.predict(X_test_bow)
accuracy_score(y_test, y_pred)

0.8437656484727091

# Using Tfidf

this code inialixes a 'TfidfVectorizer' fits it to training data, and transforms both the training and test data into TF-IDF representation.
1. corrected the variable assignment from 'tfidf' to 'x_train_tfidf'.
2. used the 'fit_transform' method to transform the training data and the 'transform' method to transform the test data.

In [71]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf= TfidfVectorizer()

X_train_tfidf = tfidf.fit_transform(X_train['review']).toarray()
X_test_tfidf = tfidf.transform(X_test['review']).toarray()

this code initialixes a RandomForestClassifier, fits it to training data with TF-IDF features ('X_train_tfidf'), predicts labels on the test data with TF-ITD features ('X_test_tfidf'), and calculates the accuracy model.

In [73]:
rf=RandomForestClassifier()

rf.fit(X_train_tfidf, y_train)
y_pred=rf.predict(X_test_tfidf)
accuracy_score(y_test, y_pred)

0.8522784176264396

# Gensim

here, 
1. we import gensim library which is popular python library for topic modeling and document similarity analysis. 
2. after fixing this, we can proceed with the rest code of our code for sentence tokenization using NLTK and simple preprocessing using Gensim.
3. we are importing the 'sent_tokenize' function from nltk for sentence tokenization and 'simple_preprocess' function from gensim for text preprocessing. 

these functions can be useful for preparing text data for tasks such as natural language processing, topic modelling, or document similarity analysis. 

In [74]:
import gensim

In [80]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\shwet\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.


True

In [81]:
from nltk import sent_tokenize
from gensim.utils import simple_preprocess

now, we used nested loop to process text data:
     1. for doc in data['review']:this outer loop iterates over each document in the 'review' column of the 'data' variable. it assumes 'data' is a DatatFrame or a similar structure.
     2. raw_sent= sent_tokenize(doc): inside the outer loop, the code uses 'sent_tokenize' from nltk to tokenize the sentences in the current document ('doc')and stores  them in the 'raw_sent' variable.
     3. for sent in raw_sent: the inner loop iterates over each sentence obtained from 'sent_tokenize' within the current document.
     4. story.append(simple_preprocess(sent)): for each sentences, the code uses 'simple_preprocessing' from gensim to perform basic text preprocessing, such as lowercasing and tokenization. the preprocessed sentence is then append to the 'story' list.

in summary, this code processes a collection of documents, tokenizes each document into sentences, and then further tokenizes and preprocesses each sentence using gensim. The preprocessed sentences are stored in 'story=[]' list.

In [82]:
story=[]
for doc in data['review']:
    raw_sent= sent_tokenize(doc)
    for sent in raw_sent:
        story.append(simple_preprocess(sent))

we are training a word2vec model using gensim on the 'story' data. let me explain the components of your code:
    
     1. model=gensim.models.Word2Vec(window=10,min_count=2): this initializes a word2vec model with a context window size of 10('window=10') and a minimum word frequency of 2('min_count=2'). the context window defines the maximum distance betwen the current and predicted word within a sentence during training.
     2. model.train(story, total_examples=model.corpus_count, epochs=model.epochs): this lines trains the word2vec model on the 'story' data. it uses the sentences in 'story as training examples. 'total_examples' is set to 'model.corpus_count' (the toatal number of sentences), and 'epochs' is set to 'model.epochs'(the number of iterations over the data).
     3. len(model.wv.index_to_key): this calculates the number of unique words in the trained word2vec model by retrieving the index-to-key mapping. the length of this mapping corresponds to vocabulary size.

In [84]:
model=gensim.models.Word2Vec(
    window=10,
    min_count=2
)

In [85]:
model.build_vocab(story)

In [86]:
model.train(story, total_examples=model.corpus_count, epochs=model.epochs)

(5876315, 6212140)

In [87]:
len(model.wv.index_to_key)

31845

here, 
1. we define a fuction document_vector(doc): that takes a single argument doc. this function is designed to calculate the mean vector of word embeddings for a given document.
2. then, we create a list comprehension that iterates through each word in the input document('doc'). it filters out words that are not present in word2vec model's vacubalory words from the document.
3. finally, lines calculates the mean vector of word embeddings for the remaining words in the document. it uses numpy's 'np.mean' function with axis=0 to compute the mean along the rows, resulting in single mean vector.


In [90]:
def document_vector(doc):
    #remove out of vocab words
    doc= [word for word in doc.split() if word in model.wv.index_to_key]
    return np.mean(model.wv[doc],axis=0)

we are trying to calculate the mean vector of word embeddings for the first document in the 'review' column of our dataframe'df'.

In [91]:
document_vector(df['review'].values[0])

array([-0.20446354,  0.48086828,  0.04406572,  0.17868163, -0.11360488,
       -0.6648532 ,  0.33844563,  1.0647515 , -0.17490748, -0.3658816 ,
       -0.2453849 , -0.5312085 , -0.01779767,  0.22050327, -0.05921424,
       -0.13726705,  0.07167196, -0.2636031 ,  0.0520041 , -0.67296475,
        0.12792939,  0.28691745,  0.32737443, -0.2829517 , -0.3757647 ,
       -0.07819744, -0.3553448 , -0.03684964, -0.484629  ,  0.20674957,
        0.5017143 , -0.07333556,  0.3365158 , -0.3449094 , -0.23993956,
        0.48830262,  0.10382417, -0.55608785, -0.32816786, -0.907156  ,
        0.08772528, -0.3460025 , -0.00423646, -0.11687607,  0.42869237,
       -0.20204553, -0.31456494, -0.27588758,  0.2633517 ,  0.38322195,
        0.21068761, -0.43524468, -0.3621421 , -0.13118613, -0.22107878,
        0.08333623,  0.16229126, -0.17648011, -0.44074973,  0.09028256,
        0.12995642, -0.06009729,  0.1622938 , -0.1013926 , -0.5386657 ,
        0.45121804, -0.01630789,  0.20510393, -0.5400061 ,  0.36

'tqdm' library is used to create progress bars in python, making it easier to visualize the progress of an iteration or a task. the name 'tqdm' stands for 'taqaddum' in arabic, which means 'progress.'

In [92]:
from tqdm import tqdm

we are using 'tqdm' to track the progress of processing documents and appending their corresponding vectors to a list 'X' using the 'document_vector'function. using 'document_vector' function. this is a good practice when dealing with large datasets or time-consuming operations.
1. for doc in tqdm(data['review'].values): this loop iterates over each document in the 'review' column of our dataset'data'. the tqdm is used to create a progress bar to visualize the iteration progress.
2.  X.append(document_vector(doc)): for each document, the 'document_vector' fuction is called is calculate the mean vector of word embeddings. the resulting vector is then appended to the list 'x'.

In [93]:
X= []
for doc in tqdm(data['review'].values):
    X.append(document_vector(doc))

100%|██████████| 9983/9983 [11:00<00:00, 15.11it/s]


here, first line converts the list x into a numpy array using np.array(x) method.
and other prints the shape of the resulting numpy array.

In [95]:
X=np.array(X)

In [96]:
X.shape

(9983, 100)

now, we imported the labelencoder for changing the sentiments which is in positive and negative values into numerical like 0,1.
then all sentiments are assign in y after changing sentiments in 0,1.

In [97]:
from sklearn.preprocessing import LabelEncoder
encoder=LabelEncoder()
y= encoder.fit_transform(data['sentiment'])

display y

In [98]:
y

array([1, 1, 1, ..., 0, 0, 1])

after changing the value of X and y in numerically, then we use all data for train and test the data. which is used in train_test_split from scikit-library.

In [99]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test=train_test_split(X, y, test_size=0.2, random_state=1)

here imported GaussianNB, accuracy_score from naive_bayes,metrics scikit learn libraries. 

In [100]:
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score

finally, model is ready and we can predict the X_test which assign in y_pred and atlast calculate the accuracy score by using matrix.

In [101]:
mnb=GaussianNB()
mnb.fit(X_train, y_train)
y_pred=mnb.predict(X_test)
accuracy_score(y_test, y_pred)

0.7155733600400601

# Conclusion

In this NLP classification Project, we achieved a commendable accuracy score approximately 71.56%. the model demonstrated the capability to effectively classify text the data into predefined categories. however, it's essential to intrepret this accuracy score in the context of specific objective and requirements of the project. 