**RentTheRunWay**

**● Import the required libraries and load the data**

1. Load the required libraries and read the dataset.


In [None]:
# importing libraries

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

from sklearn.preprocessing import LabelEncoder, MinMaxScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from scipy.cluster.hierarchy import linkage, dendrogram, fcluster
from sklearn.cluster import AgglomerativeClustering
from sklearn.metrics import silhouette_score

In [None]:
# If using Google Colab

from google.colab import drive
drive.mount('/content/drive')

In [None]:
df = pd.read_csv('/content/drive/MyDrive/Datasets/renttherunway.csv')

**2. Check the first few samples, shape, info of the data and try to familiarize yourself with different features**

In [None]:
df.head(2)

In [None]:
df.shape

In [None]:
df.info()

- There are 190K instances and 15 columns.
- We can observe the missing values in the dataset.
- There are around 10 object type variables and 5 numerical variables.

**● Data cleansing and Exploratory data analysis**

3. Check if there are any duplicate records in the dataset? If any, drop them.


In [None]:
## checking the presence of duplicate records
len(df[df.duplicated()])

4. Drop the columns which you think redundant for the analysis.(Hint: drop columns like ‘id’, ‘review’)


In [None]:
## dropping the redundant columns from the dataset.
df.drop(['user_id', 'item_id', 'review_text', 'review_summary', 'review_date'],axis=1,inplace=True)

In [None]:
df.head(2)

5. Check the column 'weight', Is there any presence of string data? If yes, remove the string data and convert to float. (Hint: 'weight' has the suffix as lbs)


In [None]:
df['weight'] = df['weight'].str.replace('lbs','').astype(float)

In [None]:
df['weight'].head()

6. Check the unique categories for the column 'rented for' and group 'party: cocktail' category with 'party'. (


In [None]:
df['rented for'].unique()

In [None]:
## grouping 'party: cocktail' category with the 'party'.
df['rented for'] = df['rented for'].str.replace('party: cocktail','party')

In [None]:
## recheck unique values after grouping
df['rented for'].unique()

7. The column 'height' is in feet with a quotation mark, Convert to inches with float datatype


In [None]:
## Removing quotation marks
df['height'] = df['height'].str.replace("'",'')
df['height'] = df['height'].str.replace('"','')

In [None]:
## Convert the feet to inches and convert the datatype to float
df['height'] = (df['height'].str[:1].astype(float)*12 + df['height'].str[1:].astype(float))

In [None]:
df['height'].head()

8. Check for missing values in each column of the dataset? If it exists, impute them with appropriate methods.


In [None]:
df.isnull().sum()/len(df)*100

In [None]:
## Lets treat categoricak columns with mode imputation technique.
for col in ['bust size','rented for','body type','category']:
    df[col].fillna(df[col].mode()[0], inplace=True)

In [None]:
## Lets treat categoricak columns with mode imputation technique.
for col in ['bust size','rented for','body type','category']:
    df[col].fillna(df[col].mode()[0], inplace=True)

In [None]:
## lets recheck the missing values
df.isnull().sum()

9. Check the statistical summary for the numerical and categorical columns and write your findings.

In [None]:
## let us check the statistical summary for the numerical columns
df.describe().T

In [None]:
## let us check the statistical summary for the categorical columns.
df.describe(include='O').T

- The average weight of the customer is around 137lbs.
- The average rating is around 9.
- The maximum height of the customer is 78 inches.
- The maximum standarized size of the product is 58.
- The age range is 0 to 117.
- Note we can see the min age is 0 we need to impute it with appropriate value and the maximun age we need to cap it to Upperlimit.
- Most of the customers rented the product for wedding and the most appeared product category is as dress.

10. Are there outliers present in the column age? If yes, treat them with the appropriate method.



In [None]:
sns.boxplot(df['age'])
plt.show()

In [None]:
## lets treat the outliers in the column age using capping techinque

df['age'] = pd.DataFrame(np.where(df['age']>=100,100,df['age']))
df['age'] = pd.DataFrame(np.where(df['age']<=20,20,df['age']))

In [None]:
sns.boxplot(df['age'])
plt.show()

In [None]:
## after applying capping technique for the column age, there might be some presence of missing values in columns age, So drop them
df.dropna(inplace=True)

11. Check the distribution of the different categories in the column 'rented for' using appropriate plot.


In [None]:
## let us check the distribution of the column rented for
sns.countplot(df['rented for'])
plt.xticks(rotation=45)
plt.show()

- We can see that the most of the customers rented the product for the wedding followed by party and formal affair.

In [None]:
## Let us make a copy of the cleaned dataset before encoding and standardizing the columns
dfc1 = df.copy()

**● Data Preparation for model building:**

12. Encode the categorical variables in the dataset


In [None]:
## Encoding categorical variables using label encoder

## select object datatype variables
object_type_variables = [i for i in df.columns if df.dtypes[i] == object]
object_type_variables


le = LabelEncoder()

def encoder(df):
    for i in object_type_variables:
        q = le.fit_transform(df[i].astype(str))
        df[i] = q
        df[i] = df[i].astype(int)
encoder(df)

In [None]:
df.head()

### 13. Standardize the data, so that the values are within a particular range.

In [None]:
## Tranforming the data using minmax scaling approach so that the values range will be 1.

mm = MinMaxScaler()

df.iloc[:,:] = mm.fit_transform(df.iloc[:,:])
df.head()

In [None]:
## Let us make a copy of the cleaned dataset after encoding and standardizing the columns.
dfc2 = df.copy()

### 14. Apply PCA on the above dataset and determine the number of PCA components to be used so that 90-95% of the variance in data is explained by the same.

In [None]:
## step1: Calculate the covariance matrix.
cov_matrix = np.cov(df.T)
cov_matrix

In [None]:
## step2: Calculate the eigen values and eigen vectors.
eig_vals, eig_vectors = np.linalg.eig(cov_matrix)
print('eigein vals:','\n',eig_vals)
print('\n')
print('eigein vectors','\n',eig_vectors)

In [None]:
## step3: Scree plot.
total = sum(eig_vals)
var_exp = [(i/total)*100 for i in sorted(eig_vals,reverse=True)]
cum_var_exp = np.cumsum(var_exp)
print('Explained Variance: ',var_exp)
print('Cummulative Variance Explained: ',cum_var_exp)

In [None]:
## Scree plot.
plt.bar(range(10),var_exp,align='center',color='lightgreen',edgecolor='black',label='Explained Variance')
plt.step(range(len(var_exp)),cum_var_exp,where='mid',color='red',label='Cummulative Explained Variance') # Use len(var_exp) here as well for consistency
plt.xlabel('Principal Components')
plt.ylabel('Explianed Variance ratio')
plt.title('Scree Plot')
plt.legend(loc='best')
plt.show()

- We can observe from the above scree plot the first 6 principal components are explaining the about 90-95% of the variation, So we can choose optimal number of principal components as 6.

### 15. Apply K-means clustering and segment the data (You may use original data or PCA transformed data)
- a. Find the optimal K Value using elbow plot for KMeans clustering.
- b. Build a Kmeans clustering model using the obtained optimal K value from the elbow plot.
- c. Compute silhoutte score for evaluating the quality of the Kmeans clustering technique.

In [None]:
## Using the dimensions obtainted from the PCA to apply clustering.(i.e, 6)
pca = PCA(n_components=6)

pca_df = pd.DataFrame(pca.fit_transform(df),columns=['PC1','PC2','PC3','PC4','PC5','PC6'])
pca_df.head()

- These are the new dimensions obtained from the application of PCA.

**#### Kmeans clustering (using the PCA tranformed data)**

In [None]:
## finding optimal K value by KMeans clustering using Elbow plot.
cluster_errors = []
cluster_range = range(2,15)
for num_clusters in cluster_range:
    clusters = KMeans(num_clusters,random_state=100)
    clusters.fit(pca_df)
    cluster_errors.append(clusters.inertia_)

In [None]:
## creataing a dataframe of number of clusters and cluster errors.
cluster_df = pd.DataFrame({'num_clusters':cluster_range,'cluster_errors':cluster_errors})

In [None]:
## Elbow plot.
plt.figure(figsize=[15,5])
plt.plot(cluster_df['num_clusters'],cluster_df['cluster_errors'],marker='o',color='b')
plt.show()

- From the above elbow plot we can see at the cluster K=3, the inertia significantly decreases . Hence we can select our optimal clusters as K=3.

In [None]:
## Applying KMeans clustering for the optimal number of clusters obtained above.
kmeans = KMeans(n_clusters=3, random_state=100)
kmeans.fit(pca_df)

In [None]:
## creating a dataframe of the labels.
label = pd.DataFrame(kmeans.labels_,columns=['Label'])

In [None]:
## joining the label dataframe to the pca_df dataframe.
kmeans_df = pca_df.join(label)
kmeans_df.head()

In [None]:
kmeans_df['Label'].value_counts()

In [None]:
## finding optimal clusters through silhoutte score
from sklearn.metrics import silhouette_score
for i in range(2,15):
    kmeans = KMeans(i,random_state=100)
    kmeans.fit(pca_df)
    labels = kmeans.predict(pca_df)
    print(i,silhouette_score(pca_df,labels))

- Above from elbow plot we chose optimal K value as 3 and we built a Kmeans clustering model.
- From the silhoutte score we can observe the for clusters 2 and 3 the score is higher. We can build Kmeans clustering model using the optimal K value as either 2 or 3.

### 16. Apply Agglomerative clustering and segement the data.  (You may use original data or PCA transformed data)
- a. Find the optimal K Value using dendrogram for Agglomerative clustering.
- b. Build a Agglomerative clustering model using the obtained optimal K value from observed from dendrogram.
- c. Compute silhoutte score for evaluating the quality of the Agglomerative clustering technique.

(Hint: Take a sample of the dataset for agglomerative clustering)

#### Agglomerative clustering (using original data)

In [None]:
## Let us use the dfc2 for this (a copy of the cleaned dataset after encoding and data standardization)

In [None]:
## Since dataset is huge plotting dendrogram might be time consuming.
## Let us take a sample of the dataset. (since the dataset is huge around 2 lakh rows, let take a sample)

In [None]:
## Taking a sample of 50K rows from the dfc2 dataframe using random sampling technique provided by pandas

## Storing it in the new dataframe called 'dfc3'
dfc3 = dfc2.sample(n=50000)

## reseting the index
dfc3.reset_index(inplace=True,drop=True)

In [None]:
dfc3.head(4)

**Dendrogram**

In [None]:
plt.figure(figsize=[18,5])
merg = linkage(dfc3, method='ward')
dendrogram(merg, leaf_rotation=90,)
plt.xlabel('Datapoints')
plt.ylabel('Euclidean distance')
plt.show()

- We look for the largest distance that we can vertically observe without crossing any horizontal line.
- We can observe from the above dendrogram that we can choose optimal clusters has 2.

In [None]:
## Building hierarchical clustering model using the optimal clusters as 2
hie_cluster = AgglomerativeClustering(n_clusters=2, affinity='euclidean',
                                     linkage='ward')
hie_cluster_model = hie_cluster.fit(dfc3)

In [None]:
## Creating a dataframe of the labels
df_label1 = pd.DataFrame(hie_cluster_model.labels_,columns=['Labels'])
df_label1.head(5)

In [None]:
## joining the label dataframe with unscaled initial cleaned dataframe.(dfc1)

df_hier = dfc1.join(df_label1)
df_hier.head()

In [None]:
for i in range(2,15):
    hier = AgglomerativeClustering(n_clusters=i)
    hier = hier.fit(dfc3)
    labels = hier.fit_predict(dfc3)
    print(i,silhouette_score(dfc3,labels))

- We can observe from the silhouette scores for the agglomerative clustering for the 2 clusers the silhouette score is higher.

### 17. Conclusion.

Perform cluster analysis by doing bivariate analysis between cluster label and different features and write your
conclusion on the results.

In [None]:
df_hier.head(2)

In [None]:
df_hier['Labels'].value_counts().plot(kind='pie',autopct='%0.1f')
plt.show()

- We can observe that the clusters formed are imbalanced. There are more number of records assigned to cluster 0 than that of cluster 1.

In [None]:
## Let us check the distribution of the different categories of 'rented for' column
## w.r.t the clusters formed by agglomerative clustering technique.
sns.countplot(df_hier['rented for'],hue='Labels',data=df_hier)
plt.xticks(rotation = 45)
plt.show()

-  We can observe that there are more number of users who have rented the product is for 'wedding' and also there are more number of users belong to the cluster 0 compare to the cluster 1.

In [None]:
## Lets check the age distribution of the different clusters.
sns.kdeplot(df_hier['age'],hue='Labels',data=df_hier)
plt.show()

- The distribution of the age for different clusters is almost same, since there are more number of observations assigned to the cluster 0.

- In this project, we have attempted to implement and apply PCA on the renttherunway dataset and we selected 6 PCA compoments, which gave us the 90-95% of the variance in the data.
- Also, we have used the PCA dimensions to cluster the data and segment the similar data in to clusters using KMeans clustering.
- We have used Kmeans clustering algorithm to cluster the data, First we chose the optimal K value with the help of elbow plot used obtained K value from elbow plot to build a kmeans clustering model.
- We have computed the silhoutte score for the different K values and evaluated the goodness of the clustering technique used.
- We took the sample of the data and did agglomerative clustering using the original data and plotted dendrogram and analyzed the optimal number of classes and built a agglomerative clustering model using the obtained K value and evaluated the model using silhoutte score.
- In this dataset, we had less number of features, further we can ask the company to collect the demographic information such as income and education. Geographic info such as where the customer is located rural or urban, city etc. Behavioral info such as browsing, spent amount by category, sentiment towards specific products and price points, and lastly the survey on lifestyle info such as hobbies, fashion etc.
- By collecting more features, the customer segmentation/clustering of similar customers into groups will be more effective and we can infer more out of the clusters formed and will able to give suggestions to the company based on the analysis that will help the business to target the right customers and stand in the market for longer and make high revenue.