### E-Commerce Clothing Reviews

In this notebook we are going to do the data preparation for our text a classification model in tensorflow  in the notebook that follows. We are going to use the dataset obtained from kaggle [womens-ecommerce-clothing-reviews](https://www.kaggle.com/datasets/nicapotato/womens-ecommerce-clothing-reviews).



### Mounting the Drive

We are going to mount the drive because we need to load the data from our google drive.

In [1]:
from google.colab import drive, files

drive.mount("/content/drive")

Mounted at /content/drive


### Basic imports
In the following code cell we are going to import basic packages that we are going to use throughout this notebook.

In [2]:
import os
import re
import numpy as np
import pandas as pd
import nltk
import random

nltk.download("punkt")
nltk.download("words")

from prettytable import PrettyTable

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


('2.8.0', '2.8.0')

### Paths

Defining paths where our data is located.

In [5]:
base_path = "/content/drive/My Drive/NLP Data/E-Commerce Reviews/"

assert os.path.exists(base_path) == True, "Path does not exists."

In [6]:
data_path = os.path.join(base_path, 'data.csv')

### Dataframe
In the following code cell we are going to read our `csv` file using `pandas`.

In [7]:
dataframe = pd.read_csv(data_path)
dataframe.head(2)

Unnamed: 0.1,Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name
0,0,767,33,,Absolutely wonderful - silky and sexy and comf...,4,1,0,Initmates,Intimate,Intimates
1,1,1080,34,,Love this dress! it's sooo pretty. i happene...,5,1,4,General,Dresses,Dresses


### Checking the size of our dataset
In the following code cell we are going to check how many examples do we have in this dataset before removing `null's`.

In [66]:
len(dataframe)

23486

Our dataset consists of null values which we need to drop them. In the following code cell we are going to drop all the `null` values collumns in this dataframe.

In [69]:
dataframe = dataframe.dropna()

### Checking the size of the dataset after dropping `na's`

In the following code cell we are going to check how many example are we left with in this dataset after removing `null` columns.

In [70]:
# Checking how many examples in this dataframe after dropping NA values
len(dataframe)

19662

### Features and labels

We are going to create a model that takes in two features and output two labels.

### Features

```
texts: String variable for the review body.
feed_count: Positive Integer documenting the number of other customers who found this review positive.
```

### Labels

```
recommended: 
    👍👎
    Binary variable stating where the customer recommends the product where 1 is recommended, 0 is not recommended.
rattings: ⭐⭐⭐⭐⭐
    Positive Ordinal Integer variable for the product score granted by the customer from 1 Worst, to 5 Best. 
```

In [71]:
texts = dataframe['Review Text'].values
feed_count = dataframe['Positive Feedback Count'].values

recommended = dataframe['Recommended IND'].values
rattings = dataframe['Rating'].values

### Text Cleaning

In the following code cells we will be cleaning our text.

In [24]:
def decontracted(phrase):
    # specific
    phrase = re.sub(r"won\'t", "will not", phrase)
    phrase = re.sub(r"can\'t", "can not", phrase)

    # general
    phrase = re.sub(r"n\'t", " not", phrase)
    phrase = re.sub(r"\'re", " are", phrase)
    phrase = re.sub(r"\'s", " is", phrase)
    phrase = re.sub(r"\'d", " would", phrase)
    phrase = re.sub(r"\'ll", " will", phrase)
    phrase = re.sub(r"\'t", " not", phrase)
    phrase = re.sub(r"\'ve", " have", phrase)
    phrase = re.sub(r"\'m", " am", phrase)
    return phrase

In [73]:
def clean_sentence(sent:str)->str:
  sent = re.sub(r'https?\S+', ' ', sent, flags=re.MULTILINE) # removing url's
  sent = re.sub(r'\d', ' ', sent) # removing none word characters
  sent = re.sub(r'[^\w\s\']', ' ', sent) # removing punctuations except for "'" in words like I'm
  sent = re.sub(r'\s+', ' ', sent).strip() # remove more than one space
  words = list()
  eng = set(nltk.corpus.words.words())
  for word in sent.split(' '):
    words.append(decontracted(word)) # replace word's like "i'm -> i am"
  return " ".join(w for w in words if w.lower() in eng or not w.isalpha()) # removing non-english words

In [76]:
cleaned_texts = list()

for sent in texts:
  cleaned_texts.append(clean_sentence(sent))

### Saving a new csv file with clean text.

In [78]:
assert len(cleaned_texts) == len(feed_count) == len(recommended) == len(rattings), "Features and labels must have the same size(s)"

In [80]:
columns = ["text", "upvotes", "recommended", "rating"]
output_df = pd.DataFrame(list(zip(
    cleaned_texts, feed_count, recommended, rattings
)), columns=columns)

In [82]:
output_df.head(5)

Unnamed: 0,text,upvotes,recommended,rating
0,I had such high for this dress and really it t...,0,0,3
1,I love love love this it is fun flirty and fab...,0,1,5
2,This shirt is very flattering to all due to th...,6,1,5
3,I love reese but this one is not for the very ...,4,0,2
4,I this in my basket at last to see what it wou...,1,1,5


In the following code cell we are going to save a cleaned version of `.csv` file based on the above dataframe.

In [83]:
output_df.to_csv(os.path.join(base_path, 'clean_data.csv'))