# Table of Content

1.  [Imports](#scrollTo=8824f40a)
2.  [Load Data](#scrollTo=5e4cf367)
3.  [Investigate Data Frame](#scrollTo=k1qgOtAfX6Dd)
4.  [Ingestigate Features](#scrollTo=81dbe88e)
5.  [Download Data](#scrollTo=z98_QqQf0Log&line=1&uniqifier=1)



#  1 Imports

In [None]:
#!pip install session-info

In [None]:
import pandas as pd
import numpy as np
import re, string
import warnings
import json
# For Visualizations
import seaborn as sns
import matplotlib.pyplot as plt 

# For Textual Data
from nltk.tokenize import word_tokenize # Tokenize
from nltk.stem.porter import PorterStemmer #Stemming 
from nltk.corpus import stopwords # To remove stopwords
import nltk
nltk.download('punkt')
nltk.download('stopwords')
from sklearn.feature_extraction.text import CountVectorizer

# See requirements
import session_info

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


# 2 Load Data

**NOTE:** The primary dataset that you will be using consists of 10,000 rows, each detailing a variety of characteristics of products sold by Online Retail. As you will see, the dataset has both continuous and categorical data, as well as a handful of data elements that could provide illuminating detail when
combined with text parsing and string manipulation [*from project guideline*]. The [source](https://drive.google.com/file/d/1avfTLZPgTlTpg1GpAFu8K0Opydz8h8ul/view) of data. 

In [None]:
from google.colab import drive
drive.mount('/content/drive')

# Defining a custom parser for column sellers to load tje JSON values direclty
def CustomParser(data):
    if data:
        data = data.replace('=>',':')
        j1 = json.loads(data)
        return j1
    else:
        return ''

df = pd.read_csv('/content/drive/MyDrive/walmart c1 bootcamp/online_retail_co_ecommerce.csv', index_col=0,  
                 converters={'sellers':CustomParser})

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


# 3 Investigate Dataframe

In [None]:
# Check the shape of data
df.shape

(10000, 17)

In [None]:
# Look the first two columns
df.head(2)

Unnamed: 0,uniq_id,product_name,manufacturer,price,number_available_in_stock,number_of_reviews,number_of_answered_questions,average_review_rating,category_and_sub_category,customers_who_bought_this_item_also_bought,description,product_information,product_description,items_customers_buy_after_viewing_this_item,customer_questions_and_answers,customer_reviews,sellers
0,eac7efa5dbd3d667f26eb3d3ab504464,Hornby 2014 Catalogue,Hornby,£3.42,5 new,15,1.0,4.9 out of 5 stars,Hobbies > Model Trains & Railway Sets > Rail V...,http://www.onlineretail.co/Hornby-R8150-Catalo...,Product Description Hornby 2014 Catalogue Box ...,Technical Details Item Weight640 g Product Dim...,Product Description Hornby 2014 Catalogue Box ...,http://www.onlineretail.co/Hornby-R8150-Catalo...,Does this catalogue detail all the previous Ho...,Worth Buying For The Pictures Alone (As Ever) ...,"{""seller""=>[{""Seller_name_1""=>""OnlineRetail.co..."
1,b17540ef7e86e461d37f3ae58b7b72ac,FunkyBuys® Large Christmas Holiday Express Fes...,FunkyBuys,£16.99,,2,1.0,4.5 out of 5 stars,Hobbies > Model Trains & Railway Sets > Rail V...,http://www.onlineretail.co/Christmas-Holiday-E...,Size Name:Large FunkyBuys® Large Christmas Hol...,Technical Details Manufacturer recommended age...,Size Name:Large FunkyBuys® Large Christmas Hol...,http://www.onlineretail.co/Christmas-Holiday-E...,can you turn off sounds // hi no you cant turn...,Four Stars // 4.0 // 18 Dec. 2015 // By\n \...,"{""seller""=>{""Seller_name_1""=>""UHD WHOLESALE"", ..."


In [None]:
# Check missing values
df.isna().sum()

uniq_id                                           0
product_name                                      0
manufacturer                                      7
price                                          1435
number_available_in_stock                      2500
number_of_reviews                                18
number_of_answered_questions                    765
average_review_rating                            18
category_and_sub_category                       690
customers_who_bought_this_item_also_bought     1062
description                                     651
product_information                              58
product_description                             651
items_customers_buy_after_viewing_this_item    3065
customer_questions_and_answers                 9086
customer_reviews                                 21
sellers                                        3082
dtype: int64

# 4 Investigate Features 

**Note:** There are 17 columns in dataframe. In this section, we will investigate features and create new features. Let's start with 'uniq_id', 'product_name', 'manufacturer' features.

In [None]:
df[['uniq_id', 'product_name', 'manufacturer']].describe()

Unnamed: 0,uniq_id,product_name,manufacturer
count,10000,10000,9993
unique,10000,9964,2651
top,eac7efa5dbd3d667f26eb3d3ab504464,Polyhedral Dice,LEGO
freq,1,3,171


## 4.1 price

In [None]:
# Check one of sample price
df['price'][0]

'£3.42'

**Note:** There is a £ sign in price column. Also, some rows price format is different tahn the others.

In [None]:
# Clean the "price" feature. Remove pound sign.
df['price'] = df['price'].str.replace('£', '')

In [None]:
#  17 rows include the prices like that "'21.99 - 27.99'"
# Replace them with NaN
df.loc[3331, 'price']

'21.99 - 27.99'

In [None]:
# Replace "-" with NaNs
for row in range(len(df)):
    if ('-' in str(df.loc[row, 'price'])) == True:
        df.loc[row, 'price'] = np.nan

# Replace "," with ''
df['price'] = df['price'].str.replace(',', '')

# Change the type of price column
df['price'] = df['price'].astype(float)

In [None]:
# Descriptive statistics of price column
df['price'].describe()

count    8547.000000
mean       20.251406
std        46.314450
min         0.010000
25%         4.990000
50%        10.560000
75%        19.990000
max      2439.920000
Name: price, dtype: float64

In [None]:
print('Number of missing prices:')
df['price'].isna().sum()

Number of missing prices:


1453

## 4.2 number_available_in_stock

In [None]:
# See one of the rows
str(df['number_available_in_stock'][0])

'5\xa0new'

**Note:** This columns includes both count of stock and type of stock.

In [None]:
# There is break space in  'number_available_in_stock' column. 
# Replace no break space (\xa0)
df['number_available_in_stock'] = df['number_available_in_stock'].str.replace('\xa0', ' ')

In [None]:
# Create a new feature that show the stock
df['number_in_stock'] = df['number_available_in_stock'].str.extract('(\d+)')
df['type_of_stock'] = df['number_available_in_stock'].str.replace('\d+', '', regex=True)

In [None]:
# See the types of stock
df['type_of_stock'].value_counts(dropna = False) #Show the number of missing values

 new            7324
NaN             2500
 used            161
 collectible      13
 refurbished       2
Name: type_of_stock, dtype: int64

**Note:** We assume that if ['number_in_stock'] is missing that means there is no stock for this product. Impute them with 0. 

If number of stock is 0, 'type_of_stock' is coded as '*Other*'.

In [None]:
# Fill missing values with 0
df['number_in_stock']= df['number_in_stock'].fillna(0)

# Change the type of this column
df['number_in_stock'] = df['number_in_stock'].astype(int)

# Filling missing values of 'type_of_stock' with a new category 
df['type_of_stock'] = df['type_of_stock'].fillna('No stock')

# replace spaces in the beginning 
df['type_of_stock'] = df['type_of_stock'].replace([' new',' used', ' collectible', ' refurbished'], # There is space 
                                                   ['New', 'Used', 'Collectible', 'Refurbished'])

In [None]:
# See the types of stock
df['type_of_stock'].value_counts()

New            7324
No stock       2500
Used            161
Collectible      13
Refurbished       2
Name: type_of_stock, dtype: int64

## 4.3 number_of_reviews

In [None]:
# the type of the column
print(f"Column type: {df['number_of_reviews'].dtypes}")
print()
print('The number of missing values:')
df['number_of_reviews'].isna().sum()

Column type: object

The number of missing values:


18

In [None]:
# Some numbers includes comma (,). replace them with '' 
df['number_of_reviews'] = df['number_of_reviews'].str.replace(',', '')

# Assumption: If df['number_of_reviews'] is missing, we accept that there is no review for this product.
# Fill missing values with 0
df['number_of_reviews'] = df['number_of_reviews'].fillna(0)

# Change the the type
df['number_of_reviews'] = df['number_of_reviews'].astype(int)

In [None]:
# Values of this feature
df['number_of_reviews'].value_counts().sort_index()

0         18
1       4315
2       1427
3        768
4        524
        ... 
649        1
690        1
802        1
1040       1
1399       1
Name: number_of_reviews, Length: 195, dtype: int64

In [None]:
# Which product have more than 1000 reviews???
df[df['number_of_reviews']>1000]

Unnamed: 0,uniq_id,product_name,manufacturer,price,number_available_in_stock,number_of_reviews,number_of_answered_questions,average_review_rating,category_and_sub_category,customers_who_bought_this_item_also_bought,description,product_information,product_description,items_customers_buy_after_viewing_this_item,customer_questions_and_answers,customer_reviews,sellers,number_in_stock,type_of_stock
8872,c751a76dd7668f78b4222b5547e7249b,TOMY Pop-Up Pirate,Tomy,9.99,59 new,1040,11.0,4.5 out of 5 stars,Characters & Brands > Tomy,http://www.onlineretail.co/Hungry-Hippos-Elefu...,Style Name:Pop-Up-Pirate/T7028 Product Descrip...,Technical Details Brand Tomy Model NumberT7028...,Style Name:Pop-Up-Pirate/T7028 Product Descrip...,,Is this good for 5 and 6 year old kids? // Hi ...,Crazy fun // 4.0 // 9 Sept. 2007 // By\n \n...,"{""seller""=>[{""Seller_name_1""=>""OnlineRetail.co...",59,New
9315,c30905e0fd0f5af84be16f794180559d,Bananagrams Game,Winning Moves,9.98,42 new,1399,1.0,4.7 out of 5 stars,Games > Educational Games,http://www.onlineretail.co/Pairs-in-Pears-Game...,Colour Name:No Manufacturer's Description Bana...,Technical Details Item Weight381 g Product Dim...,Colour Name:No Manufacturer's Description Bana...,http://www.onlineretail.co/Eco-Premium-Life%C2...,,Highly recommended // 5.0 // 16 Jan. 2008 // B...,,42,New


## 4.4 number_of_answered_questions

**Note:**
Assumption: If df['number_of_answered_questions'] is missing, we accept that there is no question for this product. Fill missing values with zero.


In [None]:
# Fill missing values with 0
df['number_of_answered_questions']= df['number_of_answered_questions'].fillna(0)

df['number_of_answered_questions'] = df['number_of_answered_questions'].astype(int)

In [None]:
# Values of this feature
df['number_of_answered_questions'].value_counts().sort_index()

0      765
1     6435
2     1469
3      566
4      235
5      161
6       82
7       46
8        7
9       54
10      11
11      79
12      22
13      21
14       1
17       1
19      13
23      14
28       4
39      14
Name: number_of_answered_questions, dtype: int64

**Note:** 765 products have no customer questions.

## 4.5 average_review_rating

In [None]:
# values of column
df['average_review_rating'].value_counts()

5.0 out of 5 stars    5140
4.0 out of 5 stars    1312
4.5 out of 5 stars     718
4.8 out of 5 stars     567
4.7 out of 5 stars     528
4.3 out of 5 stars     407
4.6 out of 5 stars     392
4.4 out of 5 stars     298
4.9 out of 5 stars     237
4.2 out of 5 stars     217
4.1 out of 5 stars     145
3.9 out of 5 stars       7
3.0 out of 5 stars       3
3.6 out of 5 stars       3
3.5 out of 5 stars       2
3.3 out of 5 stars       2
3.7 out of 5 stars       2
3.8 out of 5 stars       1
2.3 out of 5 stars       1
Name: average_review_rating, dtype: int64

**Note:** All average_review_rating ends with 'out of 5 stars'. 

In [None]:
# Extract the average rating.Set type as float
df['average_review_rating'] = df['average_review_rating'].str[:3].astype(float)

# Let's see the values
df['average_review_rating'].value_counts()

5.0    5140
4.0    1312
4.5     718
4.8     567
4.7     528
4.3     407
4.6     392
4.4     298
4.9     237
4.2     217
4.1     145
3.9       7
3.0       3
3.6       3
3.5       2
3.3       2
3.7       2
3.8       1
2.3       1
Name: average_review_rating, dtype: int64

## 4.6 category_and_sub_category

In [None]:
df['category_and_sub_category'].value_counts()

Die-Cast & Toy Vehicles > Toy Vehicles & Accessories > Scaled Models > Vehicles    880
Figures & Playsets > Science Fiction & Fantasy                                     485
Arts & Crafts > Children's Craft Kits > Bead Art & Jewellery-Making                368
Characters & Brands > Disney > Toys                                                347
Hobbies > Trading Cards & Accessories > Packs & Sets                               317
                                                                                  ... 
Hobbies > Model Trains & Railway Sets                                                1
Cooking & Dining > Kitchen Tools & Gadgets > Barware > Coasters                      1
Women > Accessories > Eyewear & Accessories > Sunglasses                             1
Hobbies > Model Trains & Railway Sets > Accessories                                  1
Games > Drinking Games                                                               1
Name: category_and_sub_category, Length: 25

In [None]:
# Split category and subcategory
df['category_and_sub_category'].str.split(">",expand=True).head()

Unnamed: 0,0,1,2,3,4
0,Hobbies,Model Trains & Railway Sets,Rail Vehicles,Trains,
1,Hobbies,Model Trains & Railway Sets,Rail Vehicles,Trains,
2,Hobbies,Model Trains & Railway Sets,Rail Vehicles,Trains,
3,Hobbies,Model Trains & Railway Sets,Rail Vehicles,Trains,
4,Hobbies,Model Trains & Railway Sets,Rail Vehicles,Trains,


In [None]:
# Create category column
df['category'] = df['category_and_sub_category'].str.split(">",expand=True).iloc[:,0]

# Create sub_category columns
for i in range(1,5):
    df[f'sub_category{i}'] = df['category_and_sub_category'].str.split(">",expand=True).iloc[:, i]

In [None]:
# Count of different main categories
print(f"There are different {df['category'].nunique()} main categories.")
# Count of different sub_categories
for i in range(1,5):
    print("There are different", df[f'sub_category{i}'].nunique(), f"sub{i} categories.")

There are different 43 main categories.
There are different 165 sub1 categories.
There are different 115 sub2 categories.
There are different 46 sub3 categories.
There are different 4 sub4 categories.


In [None]:
# Filter the top 10 categories
list_top10 =  df['category'].value_counts()[:11].index
list_top10

Index(['Hobbies ', 'Die-Cast & Toy Vehicles ', 'Figures & Playsets ',
       'Characters & Brands ', 'Games ', 'Arts & Crafts ', 'Party Supplies ',
       'Fancy Dress ', 'Sports Toys & Outdoor ', 'Dolls & Accessories ',
       'Jigsaws & Puzzles '],
      dtype='object')

## 4.7 customers_who_bought_this_item_also_bought

In [None]:
# see one of the rows
df['customers_who_bought_this_item_also_bought'][6]

'http://www.onlineretail.co/Hornby-R4388-RailRoad-Composite-Gauge/dp/B00260GEXO | http://www.onlineretail.co/Hornby-R1138-Passenger-Freight-Electric/dp/B006ZL6976'

**Note:** It is impossible to get information of related products, so we only use the number of related products.

In [None]:
# Create a new column that shows the numbers of item 
df['#_items_bought'] = df['customers_who_bought_this_item_also_bought'].str.count("http:")

In [None]:
# check one sample from new feature
print(df['#_items_bought'].sample(1))
# Check the missing values
print()
df['#_items_bought'].isna().sum()

503    6.0
Name: #_items_bought, dtype: float64



1062

In [None]:
# Assumption: If there is no link for products, we assume that nobody buy nothing. Therefore, we impute the missing values with 0.
df['#_items_bought'] = df['#_items_bought'].fillna(0)

## 4.8 description

In [None]:
# One sample from this feature
df['description'][1034]

'The evolution of Master Chief! The star of the Halo video game trilogy has evolved quite a bit over the past decade, and the Master Chief Evolution Action Figure 3-Pack covers his three best-known versions. Representing the first Halo game is Halo: Combat Evolved Master Chief. This figure includes Mark V armor, Assault Rifle, Frag Grenade, and a unique "low polygon count" design. Straight from the cover of the second game is Halo 2 Master Chief, with Mark VI armor, dual-wielded SMGs, and a Frag Grenade. Finishing the fight is Halo 3 Master Chief. He includes battle-damaged Mark VI armor, a Sniper Rifle, and a Frag Grenade. As an added bonus, both the Halo 2 and Halo 3 versions have been recreated at a larger size to fit perfectly with the Halo: Combat Evolved Master Chief and Halo: Reach figures (sold separately). Honor thy hero from the Halo franchise by ordering your 3-pack now!'

**Note:** We will create new features from product description: The length of description and count of numbers in description.

In [None]:
# Create a new feature - Length of description - # of words
df['desc_len'] = df['description'].apply(lambda x: len(str(x).split()))  

# Create a new function that shows the count of numbers in description
def find_number(x):
    num = len(re.findall(r'([0-9]+ )|([0-9]+[.:]+[0-9]+)|([0-9]+.)',x)) 
    return num

df['desc_#_numb'] = df[df['description'].isna()==False]['description'].apply(lambda x: find_number(x)) 

**Note:** It is important to investigate the extraordinary cases in product description. The length of descriptions may refer to any unusual cases.

In [None]:
# Check the values of new column
print(f"Maximum of 'desc_#_numb' columns: {df['desc_#_numb'].max()}")
print(f"Maximum of 'desc_len' columns: {df['desc_len'].max()}")
print()
df['desc_len'].value_counts().sort_index()

Maximum of 'desc_#_numb' columns: 1712.0
Maximum of 'desc_len' columns: 6164



1       692
2       119
3        77
4       106
5       118
       ... 
715       1
718       1
824       1
939       1
6164      1
Name: desc_len, Length: 374, dtype: int64

In [None]:
# Get the index of the row whics has a outlier of 'desc_len'
df[df['desc_len'] == 6164].index

Int64Index([2001], dtype='int64')

In [None]:
# Check this row which has maximum desc_#_numb
df['description'][2001]

'Product Description Manufacturer\'s Description Command your own fully functioning 12" Radio Control Dalek with authentic movement, 360 degree travel, flashing lights & speech. Includes 7 phrases such as - \'You will be exterminated\' & \'Intruder alert\', as well as \'blast\' sound effects. Also has automated head & eye movement with pose-able gun & arm. Includes \'Combat Mode\' rotating gun platform. Manufacturer\'s recommended age 5 years +. Features 360 degree travel 7 phrases, also blast sound 12" tall Flashing lights & speech Illuminated eye Automated Head & Eye movement Pose-able gun & arm Pose-able rotating centre section for \'combat mode\' Styled Control Unit 4 x AA Battery included, for \'Try me\' feature only Requires total of 1 x 9v for RC (not included) Also requires 4 x AA for Dalek (not included) Manufacturer\'s Description Command your own fully functioning 12" Radio Control Dalek with authentic movement, 360 degree travel, flashing lights & speech. Includes 7 phrases

In [None]:
# clean the messy text for df['description'][2001]
# I remove the text after 

df.loc[2001, 'description'] = df['description'][2001].split('productDescription')[0]
 
# Update the length of this row
df.loc[2001, 'desc_len'] = len(str(df['description'][2001]).split())
df.loc[2001, 'desc_#_numb'] =  find_number(df['description'][2001])

In [None]:
# Check the values of new column
print(f"Maximum of 'desc_#_numb' columns: {df['desc_#_numb'].max()}")
print(f"Maximum of 'desc_len' columns: {df['desc_len'].max()}")

Maximum of 'desc_#_numb' columns: 123.0
Maximum of 'desc_len' columns: 939


## 4.9 product_description

In [None]:
# One sample from this feature
df['product_description'][100]



**Note:** description and product description column seems similar. Check it.

In [None]:
# Are these two columns are identical
print(f"Are description and product_description columns are identical?: {df['product_description'].equals(df['description'])}")


Are description and product_description columns are identical?: False


In [None]:
# check different rows for these two column
df[df['description']!=df['product_description']][['description','product_description']]

Unnamed: 0,description,product_description
42,,
228,,
242,,
268,,
343,,
...,...,...
9978,,
9979,,
9989,,
9996,,


**NOTE**

In [None]:
# Are these two columns without NaNs are identical
print('Are description and product_description columns are identical without NANs?:')
df[df['product_description'].isna()==False].equals(df[df['description'].isna()==False])

Are description and product_description columns are identical without NANs?:


True

**Note:** description and product description columns are identical. I continue with 'description' column.

## 4.10 product_information

**Note:** Product_information column typically includes more technical details related to the product (dimensions, size charts, washing instructions, etc.). Let's determine the similar technical details in product information.

### 4.10.1 *Clean Text in Product Information for Feature Engineering*

**Note:** We don't want to use wordninja to split adjacent words as it split the float. There is a pattern between adjacents so we will split words/numbers with regex. 

* In some rows there is a text section about customer reviews. We will remove this messy part with **clean_info()** function.
* Adjacents will be split by **add_space()** function.

In [None]:
# Keep a copy of 'product_information' column just in case.
df['information'] = df['product_information'].copy()

#  Replace NANs with '' because functions don't work on missing data 
df['product_information'] = df['product_information'].fillna('')

In [None]:
# Remove messy text after "Customer Review" text
def clean_info(x):
    x = re.sub('Customer.Reviews?(.*?)stars', '', x, flags=re.DOTALL)
    return x

df['product_information'] = df['product_information'].apply(lambda x: clean_info(x))  

In [None]:
# Add space after specific terms
def add_space(x):
    x = re.sub(r'(?<=Weight)(?=[\w])', r' ', x) 
    x = re.sub(r'(?<=:)(?=[\w])', r' ', x)
    x = re.sub(r'(?<=Dimensions)(?=[\w])', r' ', x)
    x = re.sub(r'(?<=Dimensions:)(?=[\w])', r' ', x)
    x = re.sub(r'(?<=number)(?=[\w])', r' ', x) #can we combine this and next?
    x = re.sub(r'(?<=number:)(?=[\w])', r' ', x)
    x = re.sub(r'(?<=[\S])(?=No|Yes)', r' ', x)
    x = re.sub(r'(?<=Type)(?=[\w])', r' ',  x)
    x = re.sub(r'(?<=s)(?=[\d])', r' ', x)
    x = re.sub(r'(?<=Language...)(?=[\w])', r' ', x)
    x = re.sub(r'(?<=Type...)(?=[\w])', r' ', x)
    x = re.sub(r'(?<=Color)(?=[\w])', r' ', x)
    x = re.sub(r'(?<=Suitability)(?=[\w])', r' ', x)
    x = re.sub(r'(?<=Available)(?=[\d])', r' ', x)
    x = re.sub(r'(?<=Composition)(?=[\d])', r' ', x)
    x = re.sub(r'(?<=Objective.s.)(?=[\w])', r' ', x)
    x = re.sub(r'(?<=Instructions)(?=[\w])',r' ', x)
    
    return x

df['product_information'] = df['product_information'].apply(lambda x: add_space(x)) 

# Replace no break space (\xa0)
df['product_information'] = df['product_information'].str.replace('\xa0', ' ')

In [None]:
# Create a new feature - Length of description - # of words
df['info_len'] = df['product_information'].apply(lambda x: len(str(x).split()))  

# Use the find_number function that created in one of the previous sections
df['info_#_numb'] = df['product_information'].apply(lambda x: find_number(x)) 

### 4.10.2 *Get Most Common Words in Product Information*

In [None]:
# I want to remove the numbers before determining common words
# I keep clean version in a new column

def remove_number(x):
  x = re.sub(r'\d+', '', x)
  return x

df['info_clean'] = df['product_information'].apply(lambda x : remove_number(x))

In [None]:
# Instantiate a trigram vectorizer
cv_bigram_vec = CountVectorizer(max_features=100, 
                                 stop_words='english', 
                                 ngram_range = (1,2))

# Fit and apply trigram vectorizer
cv_bigram = cv_bigram_vec.fit_transform(df['info_clean'])

# Print the trigram features
print(cv_bigram_vec.get_feature_names_out())

['accessories' 'additional' 'additional information' 'age' 'age years'
 'assembly' 'assembly required' 'available' 'batteries'
 'batteries included' 'batteries required' 'best' 'best sellers' 'cm'
 'cm manufacturer' 'date' 'date available' 'delivered' 'delivered date'
 'delivery' 'delivery destinations' 'destinations' 'destinations help'
 'destinations visit' 'details' 'details item' 'details manufacturer'
 'dimensions' 'dimensions cm' 'english' 'feedback' 'feedback images'
 'feedback like' 'figures' 'figures playsets' 'games' 'games figures'
 'games shipping' 'games toys' 'help' 'help page' 'images' 'included'
 'included additional' 'info' 'info feedback' 'information' 'item'
 'item delivered' 'item model' 'item weight' 'language' 'like'
 'like update' 'main' 'main language' 'manual' 'manufacturer'
 'manufacturer recommended' 'material' 'model' 'model number' 'number'
 'number puzzle' 'page' 'page item' 'pieces' 'playsets' 'product'
 'product dimensions' 'product info' 'puzzle' 'puzzl

### 4.10.3 *Create New Features from Product Information*

**Note:** After cleaning product information column I determined common techical details both using count vectorizor and qualitative analysis. These parts of text are common:

* [Item weight](#scrollTo=hBkurqWcgQab&line=1&uniqifier=1) (Float)
* [Product Dimensions](#scrollTo=QmhBCXO8gU34&line=1&uniqifier=1)
  * Volume (Float)
  * Dimension_1, Dimension_2, Dimension_3 (Float)
* [Manufacturer recommended age](#scrollTo=hSJ_rov9g2Os&line=1&uniqifier=1)(Int)
* [Item model number](#scrollTo=Un3rlQXdsWHL&line=1&uniqifier=1) (string)
* [Assembly Required](#scrollTo=GB79q3y3g88K&line=1&uniqifier=1) - (Y/N)
* [Radio Control Suitability](#scrollTo=ilcdJ2jXAyEb) (indoor/outdoor)
* [True-False details](#scrollTo=D3ZIeWHP2GT2&line=1&uniqifier=1)
  * Main Language 
  * Educational Objective
  * Material Care Instructions
  * Material Composition
  * Batteries
* [Number of Puzzle Pieces](#scrollTo=7a8nug4w6pNF&line=1&uniqifier=1)[Int]
* [Number of Players](#scrollTo=tkijJWpfE397&line=1&uniqifier=1) (String)
* [Batteries Required](#scrollTo=RoLgaW_qhBE1) (Y/N)
* [Batteries Included](#scrollTo=bEGAI-xKh-iC) (Y/N) 
* [Remote Control Included](#scrollTo=rQN8ha0xF3Tf&line=1&uniqifier=1) (Y/N) 
* [Color](#scrollTo=q0CnHyEQFGi2)(String)
* [Material Types](#scrollTo=t2LUrWCcHmU1) (String)
* [Additional Information(ASIN)](#scrollTo=snqKPslmiVVo)(String)
* [General Category](#scrollTo=b_fzPqHGiadf)
* [Best Sellers Rank - General Category](#scrollTo=b_fzPqHGiadf) (Int) 
* [Specific Category](#scrollTo=nIhZTnqXtgYt)
* [Best Sellers Rank - Specific Category](#scrollTo=rWjPceu1jkcf) (Int)
* [Shipping Weight](#scrollTo=uhRQ1XqJiIiB) (Float)
* [Delivery Destinations](#scrollTo=3L6tid4fnnTz)(String)
* [Date First Available](#scrollTo=QXTW0nDfm3_x) (String-Date)
* [Feedback](#scrollTo=gN8OqUcWilmh)(String)

#### *Item Weight*

In [None]:
# Create a new column for ITEM WEIGHT 
df["weight"] = np.nan
df["weight_type"] = np.nan

# catch item weight for each row
for i in range(len(df)):
    try: 
        df.loc[i, 'weight'] = re.search(r'Item Weight ?((\S+ ){2})', df['product_information'][i]).group(1).split(' ')[0]
        df.loc[i, 'weight_type'] = re.search(r'Item Weight ?((\S+ ){2})', df['product_information'][i]).group(1).split(' ')[1]
    except:
        df.loc[i, 'weight_type'] = np.nan 
        df.loc[i, 'weight'] = np.nan 

In [None]:
# converty the type of weight
df["weight"] = pd.to_numeric(df["weight"])

# Let's check the measurement unit of weight
df["weight_type"].value_counts()

g     7044
Kg     708
Name: weight_type, dtype: int64

In [None]:
# Convert kg to g
# Create a new column that shows all product weight in grams

df['weight_g'] = np.where(df['weight_type'] == 'Kg', (df['weight'])*1000, df['weight'])

In [None]:
df['weight_g'].dtypes

dtype('float64')

In [None]:
# Missing values
df['weight_g'].isna().sum()

2248

#### *Product Dimension*

In [None]:
# Create a new column for Product Dimension
df["dimension"] = np.nan

# catch assembly required for each product
for i in range(len(df)):
    try: 
        df.loc[i, 'dimension'] = re.search(r'Product Dimensions([^cm]+)', df['product_information'][i]).group(1)
    except:
        df.loc[i, 'dimension'] = np.nan 
        
## note: I checked all of that in cm

In [None]:
#I want to calculate the colume of the product
df_volume = df['dimension'].str.split('x' ,expand=True).replace(',', '', regex=True).astype(float)

# Create a new column that shows the volume of each product 
df['volume'] = df_volume[0]*df_volume[1]*df_volume[2]

In [None]:
# Create new features for each dimension
for i in range(0,3):
  df[f'dimension_{i+1}'] = df['dimension'].str.split('x' ,expand=True)[i].str.replace(',','').astype(float)

#### *Manufacturer Recommended Age*

In [None]:
# Create a new column for Manufacturer Recommended Age
df["recom_age"] = np.nan

# catch Manufacturer Recommended Age for each product
for i in range(len(df)):
    try: 
        df.loc[i, 'recom_age'] = re.search(r'Manufacturer recommended age: ((\w+ ){4})', df['product_information'][i]).group(1)
    except:
        df.loc[i, 'recom_age'] = np.nan 


In [None]:
df['recom_age'].value_counts()

3 years and up       3703
14 years and up      1045
12 years and up       577
8 years and up        539
6 years and up        495
4 years and up        493
5 years and up        442
10 years and up       310
7 years and up        246
16 years and up       245
12 months and up       60
15 years and up        58
13 years and up        46
9 years and up         44
0 months and up        22
18 years and up        16
11 years and up         5
24 months and up        4
14 months and up        3
6 months and up         2
1 month and up          2
7 months and up         2
17 years and up         2
4 months and up         1
3 months and up         1
Name: recom_age, dtype: int64

In [None]:
# x months and up will be accepted as 0 years and up
df['recom_age']  = np.where(df['recom_age'].str.contains('months')==True, '0 years and up', df['recom_age'])

# Use just ages. Remove 'years and up' 
df['recom_age'] = df['recom_age'].str.extract('(\d+)')

In [None]:
df['recom_age'].isna().sum()

1637

#### *Item Model Number*

In [None]:
# Create a new column for Assembly Required
df["model_no"] = np.nan

# catch assembly required for each product
for i in range(len(df)):
    try: 
        df.loc[i, 'model_no'] = re.search(r'Item model number ((\w+ ))', df['product_information'][i]).group(1)
    except:
        df.loc[i, 'model_no'] = np.nan 

In [None]:
df['model_no'].isna().sum()

4444

#### *Assembly Required*

In [None]:
# Create a new column for Assembly Required
df["assembly"] = np.nan

# catch assembly required for each product
for i in range(len(df)):
    try: 
        df.loc[i, 'assembly'] = re.search(r'Assembly Required.?([^ >]+)', df['product_information'][i]).group(1)
    except:
        df.loc[i, 'assembly'] = np.nan 

In [None]:
print("assembly values:")
print(df["assembly"].value_counts())

# Convert YES to 1 & NO to 0
df["assembly"] = df["assembly"].map({'Yes': 1, 'No': 0})


assembly values:
No     3574
Yes     625
Name: assembly, dtype: int64


#### *Radio Control Suitability*

In [None]:
# Create a new column for Batteries Required
df["radio_control"] = np.nan

# catch batteries required for each product
for i in range(len(df)):
    try: 
        df.loc[i, 'radio_control'] = re.search(r'Radio Control Suitability.?([^ >]+)', df['product_information'][i]).group(1)
    except:
        df.loc[i, 'radio_control'] = np.nan 

In [None]:
# Values of "radio_control" column
print(df["radio_control"].value_counts())

# All values are in lower case

df["radio_control"] = df["radio_control"].str.lower()

indoor     312
Indoor     122
outdoor     30
Outdoor      6
Name: radio_control, dtype: int64


#### *True-False Details*

In [None]:
# In some rows, there is more than one language. It is not possible to scrape languages. 
# This column only shows whether the information includes Main Language or not.

df['main_lang'] = df['product_information'].str.contains('Main Language')
df['educational_obj'] = df['product_information'].str.contains('Educational Objective')
df['mat_care_inst'] = df['product_information'].str.contains('Material Care Instructions')
df['mat_compostion'] = df['product_information'].str.contains('Material Composition')
df['batteries'] = df['product_information'].str.contains('Batteries:')

In [None]:
df['educational_obj'].value_counts()

False    9666
True      334
Name: educational_obj, dtype: int64

In [None]:
df['batteries'].value_counts()

False    9735
True      265
Name: batteries, dtype: int64

#### *Number of Puzzle Pieces*

In [None]:
# Create a new column for Batteries Required
df["#_puzzle"] = np.nan

# catch batteries required for each product
for i in range(len(df)):
    try: 
        df.loc[i, '#_puzzle'] = re.search(r'Number of Puzzle Pieces ?([^ >]+)', df['product_information'][i]).group(1)
    except:
        df.loc[i, '#_puzzle'] = np.nan 

df["#_puzzle"] = pd.to_numeric(df["#_puzzle"] , errors='coerce')

In [None]:
df["#_puzzle"].value_counts().sort_index()

1.0        935
2.0         45
3.0         29
4.0         35
5.0         21
          ... 
4500.0       1
5000.0       1
7000.0       1
8600.0       1
24000.0      1
Name: #_puzzle, Length: 245, dtype: int64

#### *Number of Players*

In [None]:
# Create a new column for Batteries Required
df["#_players"] = np.nan

# catch batteries required for each product
for i in range(len(df)):
    try: 
        df.loc[i, '#_players'] = re.search(r'Number of Game Players ?([^ >]+)', df['product_information'][i]).group(1)
    except:
        df.loc[i, '#_players'] = np.nan 

In [None]:
df["#_players"].value_counts().sort_index()[:10]

1      353
1+      23
1-2      2
1-4      1
1-6     10
10       1
100      2
15       1
16       1
2      167
Name: #_players, dtype: int64

#### *Batteries Required*

In [None]:
# Create a new column for Batteries Required
df["batt_requi"] = np.nan

# catch batteries required for each product
for i in range(len(df)):
    try: 
        df.loc[i, 'batt_requi'] = re.search(r'Batteries Required?..?([^ >]+)', df['product_information'][i]).group(1)
    except:
        df.loc[i, 'batt_requi'] = np.nan 

In [None]:
df['batt_requi'].value_counts()

No     4420
Yes     275
Name: batt_requi, dtype: int64

#### *Battery Included*

In [None]:
# Create a new column for Batteries Included
df["batt_inc"] = np.nan

# catch batteries required for each product
for i in range(len(df)):
    try: 
        df.loc[i, 'batt_inc'] = re.search(r'Batteries Included..?([^ >]+)', df['product_information'][i]).group(1)
    except:
        df.loc[i, 'batt_inc'] = np.nan 

In [None]:
df['batt_inc'].value_counts()

No     4497
Yes     175
Name: batt_inc, dtype: int64

#### *Remote Control Included*

In [None]:
# Create a new column for Batteries Included
df["remote_control"] = np.nan

# catch batteries required for each product
for i in range(len(df)):
    try: 
        df.loc[i, 'remote_control'] = re.search(r'Remote Control Included.? ([^ >]+)', df['product_information'][i]).group(1)
    except:
        df.loc[i, 'remote_control'] = np.nan 

In [None]:
df["remote_control"].value_counts()

No     654
Yes     34
Name: remote_control, dtype: int64

 #### *Color*

In [None]:
# Create a new column for General Category
df["color"] = np.nan

#  catch best seller rank for each row
for i in range(len(df)):
    try: 
        df.loc[i, 'color'] = re.search(r'Color ?(\w+)', df['product_information'][i]).group(1)
    except:
        df.loc[i, 'color'] = np.nan 

In [None]:
df["color"].value_counts()[:]

multicoloured    302
multi            117
beige            112
Black            103
Blue              92
                ... 
_                  1
BLUE               1
Sky                1
BLACK              1
Hangman            1
Name: color, Length: 64, dtype: int64

 #### *Material Type*

In [None]:
# Create a new column for ADDITIONAL INFORMATION -ASIN-
df["material_type"] = np.nan

# catch ASIN for each row
for i in range(len(df)):
    try: 
        df.loc[i, 'material_type'] = re.search(r'Material Type...? (\w+)', df['product_information'][i]).group(1)
    except:
        df.loc[i, 'material_type'] = np.nan 

In [None]:
df["material_type"].value_counts()[:20]

Plastic      533
Polyester     82
Cardboard     70
Wood          60
Plush         59
Metal         57
Jacquard      53
PVC           25
Felt          22
Nylon         21
plastic       20
TUBE          15
Satin         14
Paper         12
resine        12
cardboard     12
DieCast       11
Autre         10
wood          10
plastique     10
Name: material_type, dtype: int64

 #### *Additonal Information -ASIN-*

In [None]:
# Create a new column for ADDITIONAL INFORMATION -ASIN-
df["asin"] = np.nan

# catch ASIN for each row
for i in range(len(df)):
    try: 
        df.loc[i, 'asin'] = re.search(r'Additional Information?(\W*(\w+))', df['product_information'][i]).group(1)
    except:
        df.loc[i, 'asin'] = np.nan 

In [None]:
df["asin"].isna().sum()

58

**Note:** All products have ASIN. 

#### *General Category (Best Sellers Rank)*

In [None]:
# Create a new column for General Category
df["main_cate"] = np.nan

#  catch best seller rank for each row
for i in range(len(df)):
    try: 
        df.loc[i, 'main_cate'] = re.search(r'Best Sellers Rank ?[\S]+ in ((\S+ ){3})', df['product_information'][i]).group(1)
    except:
        df.loc[i, 'main_cate'] = np.nan 
        

In [None]:
df["main_cate"].value_counts()

Toys & Games             9842
Kitchen & Home             14
Baby (See top               2
Car & Motorbike             2
Office Products (See        2
Garden & Outdoors           1
Grocery (See top            1
Computers (See top          1
DIY & Tools                 1
Name: main_cate, dtype: int64

In [None]:
df['product_information'].str.contains('Toys & Games').sum()

9852

In [None]:
# Lets correct 4 of values
df["main_cate"] = df["main_cate"].replace(['Baby (See top ', 'Office Products (See  ', 'Grocery (See top ','Computers (See top ' ],
                            ['Baby', 'Office Products', 'Grocery', 'Computers'])


#### *Best Sellers Rank -  General Category*

**Note:** The pattern for best sellers rank:
*   ... ASIN1234567890 Best Sellers Rank 123,456 in Category
*   ... ASIN1234567890 123,456 in Category

In [None]:
# Create a new column for BEST SELLERS RANK
df["best_seller_rank"] = np.nan

#  catch best seller rank for each row
for i in range(len(df)):
    try: 
        df.loc[i, 'best_seller_rank'] = re.search(r'Best Sellers Rank.?([^in]+)', df['product_information'][i]).group(1)
    except:
        df.loc[i, 'best_seller_rank'] = np.nan 
        
# Replace , with '' and convert to integere
df['best_seller_rank'] = pd.to_numeric(df["best_seller_rank"].str.replace(',', ''), errors='coerce')

In [None]:
# Create a new column for BEST SELLERS RANK 2
df["best_seller_rank_2"] = np.nan

#  catch best seller rank for each row
for i in range(len(df)):
    try: 
        df.loc[i, 'best_seller_rank_2'] = re.search(r'ASIN[\S]+  ([^ >]+)', df['product_information'][i]).group(1)
    except:
        df.loc[i, 'best_seller_rank_2'] = np.nan 
        
# Replace , with '' and convert to integere
df['best_seller_rank_2'] = pd.to_numeric(df["best_seller_rank_2"].str.replace(',', ''), errors='coerce')

In [None]:
# Fill Best Sellers Rank NaNs with Best Sellers Rank 2
df['best_seller_rank'].fillna(df['best_seller_rank_2'], inplace=True)
del df['best_seller_rank_2']

In [None]:
df['best_seller_rank'].isna().sum()

134

#### *Specific Category (Best Sellers Rank)*

In [None]:
# products that contain Toys & games
df['product_information'].str.contains('Toys & Games').describe()

count     10000
unique        2
top        True
freq       9852
Name: product_information, dtype: object

In [None]:
# products that have an specfic rank:
df[df['product_information'].str.contains('Toys & Games >')==True].shape

(8118, 58)

In [None]:
# Products that do not mention Toys & Games
df[df['product_information'].str.contains('Toys & Games')==False]['category'].value_counts()

Hobbies                              31
Fancy Dress                          27
Die-Cast & Toy Vehicles              23
Sports Toys & Outdoor                11
Pretend Play                          7
Figures & Playsets                    4
Arts & Crafts                         4
Games                                 4
Bags                                  3
Party Supplies                        3
Characters & Brands                   2
Jigsaws & Puzzles                     2
Dolls & Accessories                   2
Storage, Cleaning & Ring Sizers       1
Car Parts                             1
Sweets, Chocolate & Gum               1
Puppets & Puppet Theatres             1
Dogs                                  1
Pens, Pencils & Writing Supplies      1
Baby & Toddler Toys                   1
Jams, Honey & Spreads                 1
Name: category, dtype: int64

In [None]:
# Toys that do not have specific rank:
df[df['product_information'].str.contains('Toys & Games >')==False]['category'].value_counts()

Die-Cast & Toy Vehicles              1018
Hobbies                               654
Characters & Brands                    91
Fancy Dress                            27
Party Supplies                         12
Sports Toys & Outdoor                  11
Pretend Play                            7
Arts & Crafts                           5
Bags                                    4
Games                                   4
Figures & Playsets                      4
Supporters' Gear                        2
Jigsaws & Puzzles                       2
Dolls & Accessories                     2
Lab & Scientific Products               1
Baby & Toddler Toys                     1
Pens, Pencils & Writing Supplies        1
Dogs                                    1
Sweets, Chocolate & Gum                 1
Puppets & Puppet Theatres               1
Medication & Remedies                   1
Sex & Sensuality                        1
Jams, Honey & Spreads                   1
Laundry, Storage & Organisation   

In [None]:
# checking an odd category to be places in a Toys & Games ds:
df[df['category'].str.contains('Sex & Sensuality') == True]

Unnamed: 0,uniq_id,product_name,manufacturer,price,number_available_in_stock,number_of_reviews,number_of_answered_questions,average_review_rating,category_and_sub_category,customers_who_bought_this_item_also_bought,...,#_puzzle,#_players,batt_requi,batt_inc,remote_control,color,material_type,asin,main_cate,best_seller_rank
3217,0e3667c6a909a6a779e1f5da2017fbd6,Smiffy's Gothic Manor Vampiress - Small,Smiffy's,13.51,7 new,2,1,4.0,Sex & Sensuality > Erotic Clothing > Women's >...,http://www.onlineretail.co/HALLOWEEN-SKELETON-...,...,,,No,No,,multi,,ASINB003DS0Z9M,Toys & Games,495957.0


In [None]:
df['product_description'][3217]

'Size Name:UK 8-10 Product Description Paint the town red,blood red with this Gothic Manor Vampire Costume. Your fellow partygoers will be queuing up to donate blood as you enjoy the party in this devilish black and red seductive fancy dress outfit. Comes complete with fishtail dress with wings,collar and gloves. Box Contains <LI>1x Dress<LI>1x Collar<LI>1x Pair of Gloves'

In [None]:
df['product_information'][3217]

'Technical Details Item Weight 390 g Product Dimensions 49 x 30 x 2.5 cm Manufacturer recommended age: 18 years and up Item model number 34490S Batteries Required? No Batteries Included? No Color multi-colored    Additional Information ASINB003DS0Z9M Best Sellers Rank 495,957 in Toys & Games (See top 100) Shipping Weight 540 g Delivery Destinations: Visit the Delivery Destinations Help page to see where this item can be delivered. Date First Available 1 Jan. 2008    Feedback  Would you like to update product info or give feedback on images?'

In [None]:
df['asin'].isna().sum()

58

We see that:
 * From 9942 products with `product_information`, 9852 have "Toys & Games" on it's information.
 * The remaining 110 products are all some sort of subcategory, which means all of our dataset can be considered `Toys & Games`, and this will be considered our general rank.
 * 1824 products do not have an specific Best Seller Rank, with 1714 of them having a general rank.
 * Even products that apparently do no belong, can be inserted on the `Toys & Games` category.
 * The categories derived from the `category_and_sub_category` on the dataset are not cenessarily consistent with the categories extracted from `product_information`. We must investigate further.


In [None]:
# test if all products have a Best Seller Rank on Toys & Games:
# Some products have 2 different specific categories. We will consider the first:
for i in range(len(df)):
    try: 
        cat = re.search(r'(?<=Toys & Games \> )(.*)(?= #)', df['product_information'][i]).group(1).split(' > ')
        for j in range(len(cat)):
          df.loc[i,f'pi_sub_category{j+1}'] = cat[j]
    except:
        df.loc[i, 'pi_sub_category1'] = np.nan


df['pi_sub_category1'].describe()

count              1420
unique               42
top       Arts & Crafts
freq                286
Name: pi_sub_category1, dtype: object

In [None]:
# 1420 done from total 8118
for i in df[df['pi_sub_category1'].isna()].index:
    try: 
        cat = re.search(r'(?<=Toys & Games \> )(.*)(?= Shipping Weight)', df['product_information'][i]).group(1).split(' > ')
        for j in range(len(cat)):
          df.loc[i,f'pi_sub_category{j+1}'] = cat[j]
    except:
        df.loc[i, 'pi_sub_category1'] = np.nan

df['pi_sub_category1'].describe()

count                   6975
unique                    42
top       Figures & Playsets
freq                    1618
Name: pi_sub_category1, dtype: object

In [None]:
# 6975 out of 8118 sub-categories found.
for i in df[df['pi_sub_category1'].isna()].index:
    try: 
        cat = re.search(r'(?<=Toys & Games \> )(.*)(?= Delivery Destinations:)', df['product_information'][i]).group(1).split(' > ')
        for j in range(len(cat)):
          df.loc[i,f'pi_sub_category{j+1}'] = cat[j]
    except:
        df.loc[i, 'pi_sub_category1'] = np.nan

df['pi_sub_category1'].describe()

count                   8118
unique                    42
top       Figures & Playsets
freq                    1663
Name: pi_sub_category1, dtype: object

In [None]:
# 8118 out of 8118 sub-categories found
# compare with existing categories:
df['pi_sub_category1'].value_counts()

Figures & Playsets                                     1663
Games                                                  1115
Arts & Crafts                                           871
Party Supplies                                          721
Dressing Up                                             580
Model Trains & Railway Sets                             429
Dolls & Accessories                                     409
Sports Toys & Outdoor                                   408
Soft Toys                                               395
Jigsaws & Puzzles                                       356
Puppets & Puppet Theatres                               279
Building & Construction Toys                            256
Motorised Model Building Kits                           222
Toddler Toys                                            135
Advent Calendars                                        116
Pretend Play                                             49
Bikes, Trikes & Ride-ons                

We now have up to 11 sub categories for each product, not mentioning the parent category `Toys & Games`, but it seems we missed some patterns on generating the features. Let's investigate:

In [None]:

df[df['pi_sub_category1'] == "Building & Construction Toys #11176 in Toys & Games"]

Unnamed: 0,uniq_id,product_name,manufacturer,price,number_available_in_stock,number_of_reviews,number_of_answered_questions,average_review_rating,category_and_sub_category,customers_who_bought_this_item_also_bought,...,main_cate,best_seller_rank,pi_sub_category1,pi_sub_category2,pi_sub_category3,pi_sub_category4,pi_sub_category5,pi_sub_category6,pi_sub_category7,pi_sub_category8
5673,6d15290223d3f6d3905a0af04a12b5ea,Return To Ravnica Booster Display,Return to Ravnica Booster Display,95.0,2 new,4,3,5.0,Hobbies > Trading Cards & Accessories > Packs ...,http://www.onlineretail.co/Magic-The-Gathering...,...,Toys & Games,155861.0,Building & Construction Toys #11176 in Toys & ...,Games,,,,,,


In [None]:
df.loc[5673,'product_information']

'Technical Details Item Weight 939 g Product Dimensions 12.5 x 20.3 x 6.9 cm Item model number WTC488860000A Main Language(s) English Number of Game Players 2 + Assembly Required No Batteries Required? No Batteries Included? No    Additional Information ASINB008MW1ULQ Best Sellers Rank 155,861 in Toys & Games (See top 100) #5739 in Toys & Games > Building & Construction Toys #11176 in Toys & Games > Games #17349 in Toys & Games > Figures & Playsets Shipping Weight 1.1 Kg Delivery Destinations: Visit the Delivery Destinations Help page to see where this item can be delivered. Date First Available 23 Aug. 2012    Feedback  Would you like to update product info or give feedback on images?'

In [None]:
# discard the alternate ranks and keep only the first:
for i in range(10000):
  for j in range(1,12):
    try:
      df.at[i,f'pi_sub_category{j}'] = df.loc[i,f'pi_sub_category{j}'].split(' #')[0]
    except:
      df.at[i,f'pi_sub_category{j}'] = np.nan

In [None]:
df['pi_sub_category1'].value_counts()

Figures & Playsets               1663
Games                            1115
Arts & Crafts                     871
Party Supplies                    721
Dressing Up                       580
Model Trains & Railway Sets       429
Dolls & Accessories               411
Sports Toys & Outdoor             408
Soft Toys                         395
Jigsaws & Puzzles                 358
Puppets & Puppet Theatres         279
Building & Construction Toys      259
Motorised Model Building Kits     222
Toddler Toys                      136
Advent Calendars                  116
Pretend Play                       49
Bikes, Trikes & Ride-ons           36
Musical Toy Instruments            24
Electronic Toys                    17
Magic & Practical Jokes            14
Education & Science                10
Keyrings                            2
Money Banks                         2
Mobile Phone Charms                 1
Name: pi_sub_category1, dtype: int64

#### *Best Sellers Rank -  Specific Category*

In [None]:
# Create a new column for BEST SELLERS RANK
df["seller_rank"] = np.nan

#  catch best seller rank for each row
for i in range(len(df)):
    try: 
        df.loc[i, 'seller_rank'] = re.search(r'See top 100...?([^in]+)', df['product_information'][i]).group(1)
    except:
        df.loc[i, 'seller_rank'] = np.nan 
        
# Replace , with '' and convert to integere
df['seller_rank'] = pd.to_numeric(df["seller_rank"].str.replace(',', ''), errors='coerce')

#### *Shipping Weight*

In [None]:
# Create a new column for SHIPPING WEIGHT 
df["ship_weight"] = np.nan
df["ship_weight_type"] = np.nan

# catch item weight for each row
for i in range(len(df)):
    try: 
        df.loc[i, 'ship_weight'] = re.search(r'Shipping Weight ?((\S+ ){2})', df['product_information'][i]).group(1).split(' ')[0]
        df.loc[i, 'ship_weight_type'] = re.search(r'Shipping Weight ?((\S+ ){2})', df['product_information'][i]).group(1).split(' ')[1]
    except:
        df.loc[i, 'ship_weight_type'] = np.nan 
        df.loc[i, 'ship_weight'] = np.nan 

In [None]:
df['ship_weight_type'].value_counts()

g     7468
Kg     899
Name: ship_weight_type, dtype: int64

In [None]:
# Convert kg to g
# Create a new column that shows all shipping weight in grams

df['ship_weight_g'] = np.where(df['ship_weight_type'] == 'Kg', (df['ship_weight'])*1000, df['weight'])

#### *Delivery Destinations*

In [None]:
# Create a new column for Delivery Destinations
df["deli_dest"] = np.nan

# catch feedback for each row
for i in range(len(df)):
    try: 
        df.loc[i, 'deli_dest'] = re.search(r'Visit the Delivery Destinations(.*)Date', df['product_information'][i]).group(1)
    except:
        df.loc[i, 'deli_dest'] = np.nan 

#### *Date First Available*

In [None]:
# Create a new column for FEEDBACK
df["first_del_date"] = np.nan

# catch feedback for each row
for i in range(len(df)):
    try: 
        df.loc[i, 'first_del_date'] = re.search(r'Date First Available(.*)Feedback', df['product_information'][i]).group(1)
    except:
        df.loc[i, 'first_del_date'] = np.nan 

#### *Feedback*

In [None]:
# Create a new column for FEEDBACK
df["feedback"] = np.nan

# catch feedback for each row
for i in range(len(df)):
    try: 
        df.loc[i, 'feedback'] = re.search(r'Feedback .(.*)$', df['product_information'][i]).group(1)
    except:
        df.loc[i, 'feedback'] = np.nan 

In [None]:
df['feedback'].value_counts()

Would you like to update product info or give feedback on images?    9286
Name: feedback, dtype: int64

#### *Count of Technical Details*

In [None]:
# I want to see "How many technical details do a product have?"

technical_details = ['weight','dimension', 'recom_age',
       'model_no', 'assembly', 'radio_control', 'main_lang', 'educational_obj',
       'mat_care_inst', 'mat_compostion', 'batteries', '#_puzzle', '#_players',
       'batt_requi', 'batt_inc', 'remote_control', 'color', 'material_type',
       'asin', 'main_cate', 'best_seller_rank', 'seller_rank', 'ship_weight', 
       'deli_dest', 'first_del_date', 'feedback']

print(f"The total of technical_details is {len(technical_details)}.")

# create a new feature that shows the count of technical details of a product information
df['#_tech_details'] = (len(technical_details)) - (df[technical_details].isna().sum(axis=1))

The total of technical_details is 26.


## 4.11 items_customers_buy_after_viewing_this_item

In [None]:
# see one of the rows
df['items_customers_buy_after_viewing_this_item'][9]

'http://www.onlineretail.co/Chuggington | http://www.onlineretail.co/Learning-Chuggington | http://www.onlineretail.co/Chuggington | http://www.onlineretail.co/Learning-Curve-Chuggington'

In [None]:
# Create a new column that shows the numbers of item 
df['#_items_bought_view'] = df['items_customers_buy_after_viewing_this_item'].str.count("http:")

In [None]:
# check one sample from new feature
print(df['#_items_bought_view'].sample(1))
# Check the missing values
print()
df['#_items_bought_view'].isna().sum()

3581   NaN
Name: #_items_bought_view, dtype: float64



3065

In [None]:
# Assumption: If there is no link for products, we assume that nobody view anyhing. Therefore, we impute the missing values with 0.
df['#_items_bought_view'] = df['#_items_bought_view'].fillna(0)

## 4.12 customer_questions_and_answers

This column will be handled in "textual preprocessing" notebook.

## 4.13 customer_reviews

In [None]:
df['customer_reviews'][0]

"Worth Buying For The Pictures Alone (As Ever) // 4.0 // 6 April 2014 // By\n    \n    Copnovelist\n  \n on 6 April 2014 // Part of the magic for me growing up as a boy was to buy (or be given) the new Hornby catalogue every year, even if it included 90% of the same products as the previous year.  I've still got my old ones dating back to the 70s and 80s somewhere.  These days the catalogue is especially informative in that it tells you the vintage of the rolling stock which is useful if you are dedicating your railway to one particular era and train company. | Amazing detail fabulous photography. // 5.0 // 11 April 2015 // By\n    \n    richard\n  \n on 11 April 2015 // Amazing detail, every credit to the photographer in this book, a worthy reference manual, as well as a sales brochure. even if you only have a passing interest in the hobby you will be transported to another time when we were all younger and in awe of the big trains. | 'Great Purchase' // 5.0 // 23 April 2014 // By\n  

**Note:** Text and rating of customer reviews are not scope of project so this column won't be included in the analysis.

## 4.14 sellers

In [None]:
df['sellers'][0]

'{"seller"=>[{"Seller_name_1"=>"OnlineRetail.co.uk", "Seller_price_1"=>"£3.42"}, {"Seller_name_2"=>"**stop-&-shop-uk**", "Seller_price_2"=>"£0.19"}, {"Seller_name_3"=>"World Wide Shopping Mall Ltd", "Seller_price_3"=>"£9.99"}, {"Seller_name_4"=>"MyHobbyStore Retail", "Seller_price_4"=>"£8.00"}, {"Seller_name_5"=>"francejouet", "Seller_price_5"=>"£37.62"}]}'

**Note:** The sellers column includes the info of different sellers. The analysis of sellers will be analyzed in "competitors_data" notebook.

# 5. Download Clean DF as CSV

In [None]:
session_info.show()

In [None]:
df.shape

(10000, 78)

In [None]:
df.to_csv('df_after_cleaning.csv')

In [None]:
df.head()

Unnamed: 0,uniq_id,product_name,manufacturer,price,number_available_in_stock,number_of_reviews,number_of_answered_questions,average_review_rating,category_and_sub_category,customers_who_bought_this_item_also_bought,...,pi_sub_category11,seller_rank,ship_weight,ship_weight_type,ship_weight_g,deli_dest,first_del_date,feedback,#_tech_details,#_items_bought_view
0,eac7efa5dbd3d667f26eb3d3ab504464,Hornby 2014 Catalogue,Hornby,3.42,5 new,15,1,4.9,Hobbies > Model Trains & Railway Sets > Rail V...,http://www.onlineretail.co/Hornby-R8150-Catalo...,...,,69.0,640.0,g,640.0,Help page to see where this item can be deliv...,24 Dec. 2013,Would you like to update product info or give ...,26,4.0
1,b17540ef7e86e461d37f3ae58b7b72ac,FunkyBuys® Large Christmas Holiday Express Fes...,FunkyBuys,16.99,,2,1,4.5,Hobbies > Model Trains & Railway Sets > Rail V...,http://www.onlineretail.co/Christmas-Holiday-E...,...,,261.0,,,,Help page to see where this item can be deliv...,,,11,4.0
2,348f344247b0c1a935b1223072ef9d8a,CLASSIC TOY TRAIN SET TRACK CARRIAGES LIGHT EN...,ccf,9.99,2 new,17,2,3.9,Hobbies > Model Trains & Railway Sets > Rail V...,http://www.onlineretail.co/Classic-Train-Light...,...,,72.0,,,,Help page to see where this item can be deliv...,25 July 2013,Would you like to update product info or give ...,13,4.0
3,e12b92dbb8eaee78b22965d2a9bbbd9f,HORNBY Coach R4410A BR Hawksworth Corridor 3rd,Hornby,39.99,,1,2,5.0,Hobbies > Model Trains & Railway Sets > Rail V...,,...,,1480.0,259.0,g,259.0,Help page to see where this item can be deliv...,3 Mar. 2011,Would you like to update product info or give ...,16,0.0
4,e33a9adeed5f36840ccc227db4682a36,Hornby 00 Gauge 0-4-0 Gildenlow Salt Co. Steam...,Hornby,32.19,,3,2,4.7,Hobbies > Model Trains & Railway Sets > Rail V...,http://www.onlineretail.co/Hornby-R6367-RailRo...,...,,108.0,159.0,g,159.0,Help page to see where this item can be deliv...,12 Jun. 2013,Would you like to update product info or give ...,18,4.0


In [None]:
# This codes creates a .csv file. To see the data look at the MENU -> Files on left. 
df.to_csv(r'/content/drive/MyDrive/walmart c1 bootcamp/df_after_cleaning.csv', 
          index=False, mode='wb')