### **Business Context**

- Customer segmentation is one of the most important marketing tools at your disposal, because it can help a business to better understand its target audience. This is because it groups customers based on common characteristics.
- Segmentation can be based on the customer’s habits and lifestyle, in particular, their buying habits. Different age groups, for example, tend to spend their money in different ways, so brands need to be aware of who exactly is buying their product.
- Segmentation also, focuses more on the personality of the consumer, including their opinions, interests, reviews, and rating.
- Breaking down a large customer base into more manageable clusters, making it easier to identify your target audience and launch campaigns and promote the business to the most relevant people



### **Project Objective**

- Based on the given users and items data of an e\-commerce company, segment the similar user and items into suitable clusters. Analyze the clusters and provide your insights to help the organization promote their business



### **DATA Description**

- The datasets contain measurements of clothing fit from RentTheRunway. RentTheRunWay is a unique platform that allows women to rent clothes for various occasions. We collected data from several categories. These datasets contain self\-reported fit feedback from customers as well as other side information like reviews, ratings, product categories, catalog sizes, customers’ measurements \(etc.\)



### **Attribute Information:**

- **user\_id**: a unique id for the customer
- **item\_id**: unique product id
- **weight**: weight measurement of customer
- **rented for**: purpose clothing was rented for
- **body type**: body type of customer
- **review\_text**: review given by the customer
- **review\_summary**: summary of the review
- **size**: the standardized size of the product
- **rating**: rating for the product
- **age**: age of the customer
- **category**: the category of the product
- **bust size**: bust measurement of customer
- **height**: height of the customer
- **review\_date**: date when the review was written
- **fit**: fit feedback



#### 1. Load the required libraries and read the dataset.



In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn import metrics
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score, cross_validate
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import RidgeCV, LassoCV, ElasticNetCV
from statsmodels.stats.diagnostic import normal_ad
from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.stats.stattools import durbin_watson
from scipy import stats
from scipy.special import inv_boxcox

In [4]:
# Loading the dataset
df = pd.read_csv('renttherunway.csv')

#### 2. Check the first few samples, shape, info of the data and try to familiarize yourself with different features



In [5]:
# Initially reading dataset
df.head(5)

Unnamed: 0.1,Unnamed: 0,fit,user_id,bust size,item_id,weight,rating,rented for,review_text,body type,review_summary,category,height,size,age,review_date
0,0,fit,420272,34d,2260466,137lbs,10.0,vacation,An adorable romper! Belt and zipper were a lit...,hourglass,So many compliments!,romper,"5' 8""",14,28.0,"April 20, 2016"
1,1,fit,273551,34b,153475,132lbs,10.0,other,I rented this dress for a photo shoot. The the...,straight & narrow,I felt so glamourous!!!,gown,"5' 6""",12,36.0,"June 18, 2013"
2,2,fit,360448,,1063761,,10.0,party,This hugged in all the right places! It was a ...,,It was a great time to celebrate the (almost) ...,sheath,"5' 4""",4,116.0,"December 14, 2015"
3,3,fit,909926,34c,126335,135lbs,8.0,formal affair,I rented this for my company's black tie award...,pear,Dress arrived on time and in perfect condition.,dress,"5' 5""",8,34.0,"February 12, 2014"
4,4,fit,151944,34b,616682,145lbs,10.0,wedding,I have always been petite in my upper body and...,athletic,Was in love with this dress !!!,gown,"5' 9""",12,27.0,"September 26, 2016"


In [6]:
# Dropping serial no. column
df.drop(['Unnamed: 0'], axis=1, inplace=True)

In [7]:
# Shape of dataset
df.shape

(192544, 15)

In [8]:
# Renaming columns to improve ease-of-accessibility for pandas:

df.columns = [
    'fit', 'user_id', 'bust_size', 'item_id', 'weight', 'rating',
    'rented_for', 'review_text', 'body_type', 'review_summary', 'category',
    'height', 'size', 'age', 'review_date'
]
df.head(5)

Unnamed: 0,fit,user_id,bust_size,item_id,weight,rating,rented_for,review_text,body_type,review_summary,category,height,size,age,review_date
0,fit,420272,34d,2260466,137lbs,10.0,vacation,An adorable romper! Belt and zipper were a lit...,hourglass,So many compliments!,romper,"5' 8""",14,28.0,"April 20, 2016"
1,fit,273551,34b,153475,132lbs,10.0,other,I rented this dress for a photo shoot. The the...,straight & narrow,I felt so glamourous!!!,gown,"5' 6""",12,36.0,"June 18, 2013"
2,fit,360448,,1063761,,10.0,party,This hugged in all the right places! It was a ...,,It was a great time to celebrate the (almost) ...,sheath,"5' 4""",4,116.0,"December 14, 2015"
3,fit,909926,34c,126335,135lbs,8.0,formal affair,I rented this for my company's black tie award...,pear,Dress arrived on time and in perfect condition.,dress,"5' 5""",8,34.0,"February 12, 2014"
4,fit,151944,34b,616682,145lbs,10.0,wedding,I have always been petite in my upper body and...,athletic,Was in love with this dress !!!,gown,"5' 9""",12,27.0,"September 26, 2016"


In [9]:
# Basic information about the dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 192544 entries, 0 to 192543
Data columns (total 15 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   fit             192544 non-null  object 
 1   user_id         192544 non-null  int64  
 2   bust_size       174133 non-null  object 
 3   item_id         192544 non-null  int64  
 4   weight          162562 non-null  object 
 5   rating          192462 non-null  float64
 6   rented_for      192534 non-null  object 
 7   review_text     192476 non-null  object 
 8   body_type       177907 non-null  object 
 9   review_summary  192197 non-null  object 
 10  category        192544 non-null  object 
 11  height          191867 non-null  object 
 12  size            192544 non-null  int64  
 13  age             191584 non-null  float64
 14  review_date     192544 non-null  object 
dtypes: float64(2), int64(3), object(10)
memory usage: 22.0+ MB


By looking at the data we can make some initial observations:

- The dataframe contains 192544 entries and 15 columns.

- The dataframe contains multiple missing values across multiple columns which need to be handled.

- Dataframe contains multiple category type columns like 'category', 'rented for' and 'body type' which need to be handled.

- The 'height' column contains a string representation which needs to be converted to numeric.

- The 'weight' column contains measuring unit 'lbs' which needs to be dropped.

- Column names can be renamed for pandas accessibility.



#### 3. Check if there are any duplicate records in the dataset? If any drop them.



In [10]:
# Checking for duplicate data:

df[df.duplicated(keep=False)]

Unnamed: 0,fit,user_id,bust_size,item_id,weight,rating,rented_for,review_text,body_type,review_summary,category,height,size,age,review_date
483,fit,61928,34c,1384766,135lbs,10.0,party,This dress runs very tight in the waist. Also...,pear,I rented this dress for a black & white party....,sheath,"5' 4""",12,34.0,"September 20, 2016"
639,fit,61928,34c,1384766,135lbs,10.0,party,This dress runs very tight in the waist. Also...,pear,I rented this dress for a black & white party....,sheath,"5' 4""",12,34.0,"September 20, 2016"
705,fit,952829,36d,1522253,165lbs,8.0,other,You can dress this up or down. Great for vaca...,pear,tons of compliments. Very nice dress,dress,"5' 6""",20,42.0,"April 9, 2015"
1146,fit,188164,36d,1707988,132lbs,10.0,other,The colors of this dress are absolutely beauti...,hourglass,Felt like a Runway Model!,dress,"5' 2""",16,53.0,"August 9, 2017"
1967,fit,491875,,1707988,,10.0,party,"Comfortable, classy, and unique. A great find.",,Gorgeous dress,dress,"5' 4""",8,31.0,"July 24, 2017"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
188553,fit,213210,34d,1707988,,10.0,wedding,I got so many compliments! The color is so vib...,athletic,Beautiful and Wearable!,dress,"5' 7""",20,28.0,"August 7, 2017"
189032,small,994049,32a,1384766,128lbs,4.0,wedding,"I got my usual size, a 2, and it mostly fit ex...",athletic,Good for the Barbie figures. Odd fit for the r...,sheath,"5' 4""",4,38.0,"October 5, 2017"
189895,fit,932177,36b,1459957,150lbs,10.0,formal affair,I was worried about the length but luckily it ...,pear,Fun dress,dress,"5' 6""",20,53.0,"February 15, 2017"
189970,fit,204984,34b,1522253,119lbs,8.0,everyday,"Dress is great and super comfy, but it runs su...",hourglass,Wore this for my fiance's dirty 30 bday during...,dress,"5' 6""",8,35.0,"June 9, 2015"


There are around 189 duplicate records.



In [11]:
# Removing duplicate records

df.drop_duplicates(keep=False, inplace=True)
df[df.duplicated(keep=False)]

Unnamed: 0,fit,user_id,bust_size,item_id,weight,rating,rented_for,review_text,body_type,review_summary,category,height,size,age,review_date


No more duplicate data in dataframe


#### 4. Drop the columns which you think redundant for the analysis.\(Hint drop columns like 'id', 'review'\)



In [12]:
# Dropping columns that are redundant for the analysis

df.drop(['user_id', 'item_id','review_text','review_summary','review_date'], axis=1, inplace=True)
df.head(5)

Unnamed: 0,fit,bust_size,weight,rating,rented_for,body_type,category,height,size,age
0,fit,34d,137lbs,10.0,vacation,hourglass,romper,"5' 8""",14,28.0
1,fit,34b,132lbs,10.0,other,straight & narrow,gown,"5' 6""",12,36.0
2,fit,,,10.0,party,,sheath,"5' 4""",4,116.0
3,fit,34c,135lbs,8.0,formal affair,pear,dress,"5' 5""",8,34.0
4,fit,34b,145lbs,10.0,wedding,athletic,gown,"5' 9""",12,27.0


#### 5. Check the column 'weight', Is there any presence of string data? If yes, remove the string data and convert to float. \(Hint: 'weight' has the suffix as lbs\)



In [13]:
# Encoding column 'weight' and converting it to float data type

df['weight'] = df['weight'].astype(str).str.replace('lbs','')
df['weight'] = df['weight'].astype(float)

df.head(5)

Unnamed: 0,fit,bust_size,weight,rating,rented_for,body_type,category,height,size,age
0,fit,34d,137.0,10.0,vacation,hourglass,romper,"5' 8""",14,28.0
1,fit,34b,132.0,10.0,other,straight & narrow,gown,"5' 6""",12,36.0
2,fit,,,10.0,party,,sheath,"5' 4""",4,116.0
3,fit,34c,135.0,8.0,formal affair,pear,dress,"5' 5""",8,34.0
4,fit,34b,145.0,10.0,wedding,athletic,gown,"5' 9""",12,27.0


#### 6. Check the unique categories for the column 'rented for' and group 'party: cocktail' category with the 'party'.



In [14]:
#Checking unique categories for column 'rented_for'

df['rented_for'].unique()

array(['vacation', 'other', 'party', 'formal affair', 'wedding', 'date',
       'everyday', 'work', nan, 'party: cocktail'], dtype=object)

In [15]:
# Combining 'party' and 'party: cocktail' categories

df['rented_for'] = df['rented_for'].map({
    'vacation': 'vacation',
    'other': 'other',
    'party': 'party',
    'formal affair': 'formal affair',
    'wedding': 'wedding',
    'date': 'date',
    'everyday': 'everyday',
    'work': 'work',
    'party: cocktail': 'party'
})

df['rented_for'].unique()

array(['vacation', 'other', 'party', 'formal affair', 'wedding', 'date',
       'everyday', 'work', nan], dtype=object)

#### 7. The column 'height' is in feet with a quotation mark, Convert to inches with float datatype.



In [16]:
# Converting column 'height' to inches (float data type)

def get_inches(x):
    if type(x) == type(1.0):
        return

    try: 
        return (int(x[0])*12) + (int(x[3:-1]))
    except:
        return (int(x[0])*12)
    
df['height'] = df['height'].apply(get_inches).astype(float)
df.head(5)

Unnamed: 0,fit,bust_size,weight,rating,rented_for,body_type,category,height,size,age
0,fit,34d,137.0,10.0,vacation,hourglass,romper,68.0,14,28.0
1,fit,34b,132.0,10.0,other,straight & narrow,gown,66.0,12,36.0
2,fit,,,10.0,party,,sheath,64.0,4,116.0
3,fit,34c,135.0,8.0,formal affair,pear,dress,65.0,8,34.0
4,fit,34b,145.0,10.0,wedding,athletic,gown,69.0,12,27.0


#### 8. Check for missing values in each column of the dataset? If it exists, impute them with appropriate methods.



In [17]:
# Looking at the percentage of missing values per column:

pd.DataFrame({'total_missing': df.isnull().sum(), 'perc_missing': (df.isnull().sum()/82790)*100})

Unnamed: 0,total_missing,perc_missing
fit,0,0.0
bust_size,18373,22.192294
weight,29928,36.149293
rating,80,0.09663
rented_for,10,0.012079
body_type,14613,17.650682
category,0,0.0
height,673,0.8129
size,0,0.0
age,960,1.15956


All columns except 'size', 'category', and 'fit' have missing values.



In [18]:
## Using median imputation for numerical columns 

for col in ['weight','rating','height','age']:
    df[col].fillna(df[col].median(), inplace=True)

In [19]:
## Using mode imputation for categorical columns

for col in ['bust_size','rented_for','body_type','category']:
    df[col].fillna(df[col].mode()[0], inplace=True)

In [20]:
# Recheck missing values after imputation:

pd.DataFrame({'total_missing': df.isnull().sum(), 'perc_missing': (df.isnull().sum()/82790)*100})

Unnamed: 0,total_missing,perc_missing
fit,0,0.0
bust_size,0,0.0
weight,0,0.0
rating,0,0.0
rented_for,0,0.0
body_type,0,0.0
category,0,0.0
height,0,0.0
size,0,0.0
age,0,0.0


No more missing values left in the dataset.


#### 9. Check the statistical summary for the numerical and categorical columns and write your findings.



In [21]:
# Statistical description of numerical columns

df.describe()

Unnamed: 0,weight,rating,height,size,age
count,192166.0,192166.0,192166.0,192166.0,192166.0
mean,137.020467,9.092659,65.309139,12.246428,33.859575
std,20.145691,1.429982,2.659036,8.497723,8.039723
min,50.0,2.0,54.0,0.0,0.0
25%,125.0,8.0,63.0,8.0,29.0
50%,135.0,10.0,65.0,12.0,32.0
75%,145.0,10.0,67.0,16.0,37.0
max,300.0,10.0,78.0,58.0,117.0


In [22]:
# Statistical description of categorical variables

df.describe(include='O')

Unnamed: 0,fit,bust_size,rented_for,body_type,category
count,192166,192166,192166,192166,192166
unique,3,106,8,7,68
top,fit,34b,wedding,hourglass,dress
freq,141760,45598,57700,69844,92620


- The weight range of the customer is 50\-300 lbs with an average of around 137 lbs.
- The average rating is around 9.1.
- The height range of the customer is 54\-78 in with an average of around 65 in.
- The maximum size 58  with an average of around 12.
- The average age of customer is around 34 years.
- Since the min age is 0, we need to impute it with appropriate value and the maximum age needs to  be capped to upper limit.
- Most of the customers rented the product for wedding and the most appeared product category is as dress.



In [23]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 192166 entries, 0 to 192543
Data columns (total 10 columns):
 #   Column      Non-Null Count   Dtype  
---  ------      --------------   -----  
 0   fit         192166 non-null  object 
 1   bust_size   192166 non-null  object 
 2   weight      192166 non-null  float64
 3   rating      192166 non-null  float64
 4   rented_for  192166 non-null  object 
 5   body_type   192166 non-null  object 
 6   category    192166 non-null  object 
 7   height      192166 non-null  float64
 8   size        192166 non-null  int64  
 9   age         192166 non-null  float64
dtypes: float64(4), int64(1), object(5)
memory usage: 16.1+ MB


The 'height' and 'weight' columns have been converted to float data type.


Missing Data for each column:
