

# Predicting Customer Satisfaction on Rent the Runway

##  IV. Feature Importance Selection

### Katrin Ayrapetov


<font style="font-size: 2rem; color: blue">


 
</font>

### Overview of the Notebook: 

In this notebook, Chi Squared Test and ADA boost were used to select most important features. Then the two lists were joined. 

**According to the Chi Squared Test, the most important features are:**  


<br>&emsp;&emsp;Feature 1 is Type_of_Customer
<br>&emsp;&emsp;Feature 2 is Size
<br>&emsp;&emsp;Feature 3 is Overall_fit
<br>&emsp;&emsp;Feature 4 is Age
<br>&emsp;&emsp;Feature 5 is Weight
<br>&emsp;&emsp;Feature 6 is Date
<br>&emsp;&emsp;Feature 7 is Brand
<br>&emsp;&emsp;Feature 8 is Number_of_reviews
<br>&emsp;&emsp;Feature 9 is BMI
<br>&emsp;&emsp;Feature 10 is Neckline


**According to ADA Boost, the most important features are:**  

<br>&emsp;&emsp;Feature 1 is Size
<br>&emsp;&emsp;Feature 2 is Overall_fit
<br>&emsp;&emsp;Feature 3 is Number_of_reviews
<br>&emsp;&emsp;Feature 4 is Type_of_Customer
<br>&emsp;&emsp;Feature 5 is Rented_for
<br>&emsp;&emsp;Feature 6 is Age
<br>&emsp;&emsp;Feature 7 is Dress_Style
<br>&emsp;&emsp;Feature 8 is Neckline
<br>&emsp;&emsp;Feature 9 is Weight
<br>&emsp;&emsp;Feature 10 is BMI

**According to both tests, the most important features are: **  

<br>&emsp;&emsp;Feature 1 is Size
<br>&emsp;&emsp;Feature 2 is Neckline
<br>&emsp;&emsp;Feature 3 is BMI
<br>&emsp;&emsp;Feature 4 is Dress_Style
<br>&emsp;&emsp;Feature 5 is Type_of_Customer
<br>&emsp;&emsp;Feature 6 is Weight
<br>&emsp;&emsp;Feature 7 is Date
<br>&emsp;&emsp;Feature 8 is Number_of_reviews
<br>&emsp;&emsp;Feature 9 is Rented_for
<br>&emsp;&emsp;Feature 10 is Brand
<br>&emsp;&emsp;Feature 11 is Age
<br>&emsp;&emsp;Feature 12 is Overall_fit

In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn import preprocessing
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.feature_selection import SelectFromModel
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import AdaBoostClassifier
import warnings
warnings.filterwarnings("ignore")

In [2]:
import itertools
from sklearn import model_selection
from collections import Counter
from sklearn.datasets import make_classification
from imblearn.over_sampling import RandomOverSampler

In [3]:
df =  pd.read_csv('../Data/df_clean.csv')

In [4]:
df.drop(columns=["Retail_price"],inplace=True)

In [5]:
# Use label encoder to encode the categorical features. 
lencoders = {}
for col in df.select_dtypes(include=['object']).columns:
    lencoders[col] = LabelEncoder()
    df[col] = lencoders[col].fit_transform(df[col])

In [6]:
#Use r_scaler to scale all data data. 
r_scaler = preprocessing.MinMaxScaler()
r_scaler.fit(df)
modified_data = pd.DataFrame(r_scaler.transform(df), columns=df.columns)


### Select top ten features using the Chi Square 

In [8]:
# Break up the data frame into the features and the target variable
X = modified_data.loc[:,modified_data.columns!='Rating']
y = modified_data[['Rating']]


In [9]:
y=y.astype('int')

In [11]:
#Instantiate the SelectKBest class using Chi Squared Test and top 10 features. 
selector = SelectKBest(chi2, k=10)

In [12]:
#Fit the selector to the data. 
selector.fit(X, y)

SelectKBest(score_func=<function chi2 at 0x0000017DCD0951F0>)

In [13]:
#Transform the data using the fitted selector. 
X_new = selector.transform(X)


In [14]:
#Print the important featutures. 
imp_feat_chi_squared = list(X.columns[selector.get_support(indices=True)])
print("Important features using Chi Square")
for i in range(10):
    print(f"Feature {i + 1} is {imp_feat_chi_squared[i]}")

Important features using Chi Square
Feature 1 is Type_of_Customer
Feature 2 is Size
Feature 3 is Overall_fit
Feature 4 is Age
Feature 5 is Weight
Feature 6 is Date
Feature 7 is Brand
Feature 8 is Number_of_reviews
Feature 9 is BMI
Feature 10 is Neckline


### Select top ten features using ADA Boost 

#### Use ADA boost with one stump to select the most important feature, then remove it from the data frame and repeat the process.  Make a list of the top ten important features. 

In [15]:
features = ['Type_of_Customer', 'Size', 'Overall_fit', 'Rented_for', 'Size_usually_worn',
            'Height', 'Age',"Weight", 'Bust_size', 'Body_type',  'Date','Rent_price', 'Number_of_reviews', 'Sleeves', 'Neckline', 'Dress_Style','BMI','Brand']
target = ['Rating']


In [16]:
imp_feat_ADA_boost = []

In [17]:
for i in range(10):
    X = df[features]
    y=df[target]
    from sklearn.ensemble import AdaBoostClassifier
    model = AdaBoostClassifier(n_estimators=1)
    model.fit(X,y)
    important_feature = X.columns[model.feature_importances_.argmax()]
    print(f"Feature {i + 1} is {important_feature}")
    features.remove(important_feature)
    imp_feat_ADA_boost.append(important_feature)

Feature 1 is Size
Feature 2 is Overall_fit
Feature 3 is Number_of_reviews
Feature 4 is Type_of_Customer
Feature 5 is Rented_for
Feature 6 is Age
Feature 7 is Dress_Style
Feature 8 is Neckline
Feature 9 is Weight
Feature 10 is BMI


In [18]:
#Join Two lists together and take out repeats. 
all_important_features = imp_feat_chi_squared + imp_feat_ADA_boost

In [19]:
all_important_features = list(set(all_important_features))

In [20]:
print("Important features:")
for i in range(len(all_important_features)):
    print(f"Feature {i + 1} is {all_important_features[i]}")

Important features:
Feature 1 is Size
Feature 2 is Neckline
Feature 3 is BMI
Feature 4 is Dress_Style
Feature 5 is Type_of_Customer
Feature 6 is Weight
Feature 7 is Date
Feature 8 is Number_of_reviews
Feature 9 is Rented_for
Feature 10 is Brand
Feature 11 is Age
Feature 12 is Overall_fit
