### E-Commerce Clothing Reviews

In this notebook we are going to do the data preparation for our text a classification model in tensorflow  in the notebook that follows. We are going to use the dataset obtained from kaggle [womens-ecommerce-clothing-reviews](https://www.kaggle.com/datasets/nicapotato/womens-ecommerce-clothing-reviews).



### Mounting the Drive

We are going to mount the drive because we need to load the data from our google drive.

In [1]:
from google.colab import drive, files
drive.mount("/content/drive")

Mounted at /content/drive


Next we are going to install the package called `helperfns` which contains some helper functions utilities for machine learning.

In [2]:
!pip install helperfns -q

### Basic imports
In the following code cell we are going to import basic packages that we are going to use throughout this notebook.

In [3]:
import os
import re
import numpy as np
import pandas as pd
import nltk
import random

from helperfns.tables import tabulate_data
from helperfns.text import de_contract

nltk.download("punkt")
nltk.download("words")

[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Unzipping corpora/words.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Package words is already up-to-date!


True

### Paths

Defining paths where our data is located.

In [4]:
base_path = "/content/drive/My Drive/NLP Data/E-Commerce Reviews/"

assert os.path.exists(base_path), "Path does not exists."

In [5]:
data_path = os.path.join(base_path, 'data.csv')

### Dataframe
In the following code cell we are going to read our `csv` file using `pandas`.

In [6]:
dataframe = pd.read_csv(data_path)
dataframe.drop(columns=['Unnamed: 0'], inplace=True)
dataframe.head(2)

Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name
0,767,33,,Absolutely wonderful - silky and sexy and comf...,4,1,0,Initmates,Intimate,Intimates
1,1080,34,,Love this dress! it's sooo pretty. i happene...,5,1,4,General,Dresses,Dresses


### Checking the size of our dataset
In the following code cell we are going to check how many examples do we have in this dataset before removing `null's`.

In [7]:
len(dataframe)

23486

To clean our data first we want to check which columns are we intrested in for us to create our model. These columns are called feature columns. We only intrested in two columns which are:

1. `Review Text`
2. `Positive Feedback Count`

And our labels columns will be:

1. `Rattings` - A number between `0` and `5`.
2. `Recommended IND` - weather the product can recommended or not `1` for recommended and `0` for not recommended.


From our dataframe we need to drop all the columns that we don't care about, which are:

1. `Clothing ID`
2. `Age`
3. `Title`
4. `Division Name`
5. `Department Name`
6. `Class Name`

In [8]:
dataframe.drop(columns=[c for c in dataframe.columns if not c in ['Review Text', 'Rating', 'Recommended IND', 'Positive Feedback Count']], inplace=True)
dataframe.head()

Unnamed: 0,Review Text,Rating,Recommended IND,Positive Feedback Count
0,Absolutely wonderful - silky and sexy and comf...,4,1,0
1,Love this dress! it's sooo pretty. i happene...,5,1,4
2,I had such high hopes for this dress and reall...,3,0,0
3,"I love, love, love this jumpsuit. it's fun, fl...",5,1,0
4,This shirt is very flattering to all due to th...,5,1,6


Next thing we are going to check if we have any null values in our dataset.

In [9]:
dataframe.isna().any()

Review Text                 True
Rating                     False
Recommended IND            False
Positive Feedback Count    False
dtype: bool

Our dataset consists of null values which we need to drop them. In the following code cell we are going to drop all the `null` values collumns in this dataframe.

In [10]:
dataframe.dropna(inplace=True)

### Checking the size of the dataset after dropping `na's`

In the following code cell we are going to check how many example are we left with in this dataset after removing `null` columns.

In [11]:
# Checking how many examples in this dataframe after dropping NA values
len(dataframe)

22641

### Features and labels

We are going to create a model that takes in two features and output two labels.

### Features

```
texts: String variable for the review body.
feed_count: Positive Integer documenting the number of other customers who found this review positive.
```

### Labels

```
recommended:
    👍👎
    Binary variable stating where the customer recommends the product where 1 is recommended, 0 is not recommended.
rattings: ⭐⭐⭐⭐⭐
    Positive Ordinal Integer variable for the product score granted by the customer from 1 Worst, to 5 Best.
```

In the following code cell we are going to rename columns, for our dataset.

In [12]:
columns_map = {'Review Text': "text", 'Positive Feedback Count': "upvotes", 'Recommended IND': "recommended", 'Rating': "rating"}
dataframe.rename(columns=columns_map, inplace=True)
dataframe.head()

Unnamed: 0,text,rating,recommended,upvotes
0,Absolutely wonderful - silky and sexy and comf...,4,1,0
1,Love this dress! it's sooo pretty. i happene...,5,1,4
2,I had such high hopes for this dress and reall...,3,0,0
3,"I love, love, love this jumpsuit. it's fun, fl...",5,1,0
4,This shirt is very flattering to all due to th...,5,1,6


### Text Cleaning

In the following code cells we will be cleaning our text. For text cleaning we are going to apply two functions from the `helperfns.text` which are:

1. `de_contract`
2. `clean_sentence`


In [15]:
def clean_sentence(text):
    # Remove URLs
    text = re.sub(r'http\S+', '', text)
    # Remove numbers
    text = re.sub(r'\d+', '', text)
    # Remove punctuation
    text = re.sub(r'[^\w\s]', '', text)
    # Remove HTML tags
    text = re.sub(r'<.*?>', '', text)
    # Remove extra whitespace
    text = re.sub(r'\s+', ' ', text).strip()
    return text

In [16]:
dataframe['text'] = dataframe['text'].apply(clean_sentence)
dataframe['text'] = dataframe['text'].apply(de_contract)
dataframe.head()

Unnamed: 0,text,rating,recommended,upvotes
0,Absolutely wonderful silky and sexy and comfor...,4,1,0
1,Love this dress it is sooo pretty i happened t...,5,1,4
2,I had such high hopes for this dress and reall...,3,0,0
3,I love love love this jumpsuit it is fun flirt...,5,1,0
4,This shirt is very flattering to all due to th...,5,1,6


### Saving a new csv file with clean text.

In the following code cell we are going to save a cleaned version of `.csv` file based on the above dataframe.

In [20]:
dataframe.to_csv(os.path.join(base_path, 'clean_data.csv'), index=False)
print('Saved!')

Saved!
