**Import Libraries**

In [0]:
import pandas as pd
from bs4 import BeautifulSoup as soup
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem.snowball import SnowballStemmer
import re

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


**Load Dataset**
> Here, I am loading the concatenated dataset from the previous notebook so it's easy to process the data all-together.

In [0]:
df = pd.read_csv("/content/drive/My Drive/Drug Reviews Dataset/DrugReviews.csv")

**Clean Data**

From the previous notebook, it was known that there were 1194 NULL values in the *condition* column. Let's drop it.

In [0]:
df = df.dropna()
df = df.reset_index(drop=True)

Now, I'm gonna get rid of the rows which had *condition* values containing < /span>.

In [0]:
span_list = list()
for i, condition in enumerate(df['condition']):
  if '</span>' in condition:
    span_list.append(i)

In [0]:
len(span_list)

1171

There 1171 rows in which the *condition* value contains < /span>. I have fetched the indices of those rows and now it'll be easy to drop them.

In [0]:
df = df.drop(index=span_list)
df = df.reset_index(drop=True)

In [0]:
print(any(['</span>' in c for c in df['condition']]))

False


Now you can see that the *condition* column does not have any values containing < /span>.

**Review Pre-Processing**
> For efficient Sentiment Analysis, you need text data which can be easily transformed to computer understandable vectors. For doing that, you mostly need to get rid of all the unnecessary characters and just include the required words.

Stop words are a set of words most commonly used in any language. So you might wanna get rid of them, otherwise your vocabulary will be very bulky.

In [0]:
stops = set(stopwords.words('english'))

In [0]:
not_stops = ["aren't","couldn't","didn't","doesn't","don't","hadn't","haven't","isn't","mightn't","mustn't","needn't","no","nor","not","shan't","shouldn't","wasn't","weren't","won't","wouldn't"]

In [0]:
for not_stop in not_stops:
  stops.remove(not_stop)

I have excluded some of the stop words (stored in *not_stops*) as these words can be essential in sentiment analysis.

Stemming:
> Stemming is a process of reducing derived words to their root form. Again, this is helpful in keeping a less bulky vocabulary. I'll tell you how. Consider an example of the words *clean*, *cleaning* and *cleaned*. These three words have the same meaning and just differ in the tense. Now without stemming, these three words would be unique in the vocabulary and would have different representations. But, thanks to Stemming, you can represent these words with just one representation (say, *clean*). So, in this way, you can represent a bunch of similar words with their root.

You can use the inbuilt Snowball Stemmer from the NLTK library to do your job.

In [0]:
stemmer = SnowballStemmer('english')

In [0]:
def clean_reviews(raw_text):

  # As the data has been scraped, it will have some HTML. We will get rid of that using BeautifulSoup.
  text = soup(raw_text, 'html.parser').get_text()

  # Remove punctuation marks and retain just the words
  punct_removed = re.sub('[^a-zA-Z0-9\']', ' ', text)

  # Convert all words to lower case
  words = punct_removed.lower().split()

  # Fetch only meaningful words
  required_words = [word for word in words if word not in stops]

  # Stem similar words
  stemmed = [stemmer.stem(rw) for rw in required_words]
  
  # Return the cleaned text
  return " ".join(stemmed)

Apply the *clean_reviews* function to the *review* column and store the cleaned reviews in a new column *cleaned_reviews*

In [0]:
df['cleaned_reviews'] = df['review'].apply(clean_reviews)

Uploading the two columns *cleaned_reviews* and *sentiment* as a DataFrame onto my drive for building a Deep Learning Model with Word Embeddings in the next notebook.

In [0]:
df[['cleaned_reviews','sentiment']].to_csv('DrugReviewsSentimentAnalysis.csv')
!cp DrugReviewsSentimentAnalysis.csv "drive/My Drive/Drug Reviews Dataset/"