# Company 2 Company Business Analytics

This project will work on a fashion customer-to-customer (C2C) e-commerce platform that enables users to sell products to other users on the platform. The performance of a seller on the platform does not only depend on the user interface of the C2C platform, but it is also a function of the type of products uploaded by the seller, the image quality of the product, its description, customer service and social engagement of sellers (The factors are not limited to those stated). The following insights will be drawn from the available data: • Factors that contribute to sellers being able to generate good sales from the e-commerce platform • What is the typical lifetime value of a customer on the platform? • What is the average retention rate of buyers on the e-commerce platform? • Considering that the platform is situated in France, what is the tendency that other users from other countries will sign up on the platform. • How active are users generally on the e-commerce platform?

### Required Steps
* Create dummy or indicator features for categorical variables
* Standardize the magnitude of numeric features using a scaler
* Split your data into testing and training datasets

Load Packages

In [1]:
#load python packages
import os
import pandas as pd
import datetime 
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
from sklearn.cluster import KMeans
from sklearn.model_selection import train_test_split
%matplotlib inline

currentdirectory = os.getcwd()
print(currentdirectory)

/Users/oluwafemibabatunde


In [2]:
path = '/Users/oluwafemibabatunde/Desktop/Springboard/capstone_three/C2C Business Analytics/data'
os.chdir(path)
dfProcessed = pd.read_csv('step2DF_output.csv')

In [3]:
dfProcessed.head()

Unnamed: 0.1,Unnamed: 0,identifierHash,socialNbFollowers,socialNbFollows,socialProductsLiked,productsListed,productsSold,productsPassRate,productsWished,productsBought,...,gender_F,gender_M,civilityTitle_miss,civilityTitle_mr,civilityTitle_mrs,lang_de,lang_en,lang_es,lang_fr,lang_it
0,0,-1097895247965112460,147,10,77,26,174,74.0,104,1,...,0,1,0,1,0,0,1,0,0,0
1,1,2347567364561867620,167,8,2,19,170,99.0,0,0,...,1,0,0,0,1,0,1,0,0,0
2,2,6870940546848049750,137,13,60,33,163,94.0,10,3,...,1,0,0,0,1,0,0,0,1,0
3,3,-4640272621319568052,131,10,14,122,152,92.0,7,0,...,1,0,0,0,1,0,1,0,0,0
4,4,-5175830994878542658,167,8,0,25,125,100.0,0,0,...,1,0,0,0,1,0,1,0,0,0


In [4]:
dfProcessed = dfProcessed.drop(['Unnamed: 0'], axis=1)

In [5]:
dfProcessed.shape

(98913, 227)

In [6]:
dfProcessed.dtypes

identifierHash         int64
socialNbFollowers      int64
socialNbFollows        int64
socialProductsLiked    int64
productsListed         int64
                       ...  
lang_de                int64
lang_en                int64
lang_es                int64
lang_fr                int64
lang_it                int64
Length: 227, dtype: object

In [7]:
dfProcessed['seniority']

0        3196
1        3204
2        3203
3        3198
4        2854
         ... 
98908    3204
98909    3204
98910    3204
98911    3204
98912    3204
Name: seniority, Length: 98913, dtype: int64

During the data wrangling stage of this project the categorical were processed as the models that will be used will require the categorical variables to be converted to data types that are processible.

Normalize and Standardizing some features of the dataset for modeling 

In [8]:
dfProcessed.describe()

Unnamed: 0,identifierHash,socialNbFollowers,socialNbFollows,socialProductsLiked,productsListed,productsSold,productsPassRate,productsWished,productsBought,civilityGenderId,...,gender_F,gender_M,civilityTitle_miss,civilityTitle_mr,civilityTitle_mrs,lang_de,lang_en,lang_es,lang_fr,lang_it
count,98913.0,98913.0,98913.0,98913.0,98913.0,98913.0,98913.0,98913.0,98913.0,98913.0,...,98913.0,98913.0,98913.0,98913.0,98913.0,98913.0,98913.0,98913.0,98913.0,98913.0
mean,-6692039000000000.0,3.432269,8.425677,4.420743,0.093304,0.121592,0.812303,1.562595,0.171929,1.773993,...,0.769575,0.230425,0.004418,0.230425,0.765157,0.072569,0.521307,0.060993,0.266618,0.078513
std,5.330807e+18,3.882383,52.839572,181.030569,2.050144,2.126895,8.500205,25.192793,2.332266,0.428679,...,0.421107,0.421107,0.066322,0.421107,0.423903,0.259429,0.499548,0.239319,0.442193,0.268979
min,-9.223101e+18,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,-4.622895e+18,3.0,8.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,...,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
50%,-1337989000000000.0,3.0,8.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,...,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0
75%,4.616388e+18,3.0,8.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,...,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0
max,9.223331e+18,744.0,13764.0,51671.0,244.0,174.0,100.0,2635.0,405.0,3.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [9]:
col_names = ['socialNbFollowers',
       'socialNbFollows', 'socialProductsLiked', 'productsListed',
       'productsSold', 'productsPassRate', 'productsWished', 'productsBought',
       'daysSinceLastLogin',
       'seniority']
dflog = np.log(dfProcessed[col_names]+1)
dflog.describe()

Unnamed: 0,socialNbFollowers,socialNbFollows,socialProductsLiked,productsListed,productsSold,productsPassRate,productsWished,productsBought,daysSinceLastLogin,seniority
count,98913.0,98913.0,98913.0,98913.0,98913.0,98913.0,98913.0,98913.0,98913.0,98913.0
mean,1.446699,2.205499,0.2919,0.022524,0.027875,0.041911,0.153908,0.058182,6.089242,8.026194
std,0.211269,0.101268,0.838855,0.20061,0.229073,0.429967,0.594886,0.282989,1.125791,0.055625
min,1.386294,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,7.956126
25%,1.386294,2.197225,0.0,0.0,0.0,0.0,0.0,0.0,6.331502,7.957877
50%,1.386294,2.197225,0.0,0.0,0.0,0.0,0.0,0.0,6.527958,8.069968
75%,1.386294,2.197225,0.0,0.0,0.0,0.0,0.0,0.0,6.539586,8.071531
max,6.613384,9.529884,10.852671,5.501258,5.164786,4.615121,7.877018,6.006353,6.549651,8.072779


In [10]:
from sklearn.preprocessing import StandardScaler

SS_scaler = StandardScaler()

col_names = ['socialNbFollowers',
       'socialNbFollows', 'socialProductsLiked', 'productsListed',
       'productsSold', 'productsPassRate', 'productsWished', 'productsBought',
       'daysSinceLastLogin',
       'seniority']

scaledCols = ['ScSocialNbFollowers',
       'ScsocialNbFollows', 'ScsocialProductsLiked', 'ScproductsListed',
       'ScproductsSold', 'ScproductsPassRate', 'ScproductsWished', 'ScproductsBought',
       'ScdaysSinceLastLogin',
       'Scseniority'] 

dfScale = SS_scaler.fit_transform(dfProcessed[col_names])

dfScale = pd.DataFrame(dfScale , columns=scaledCols)



dfScale.head()

Unnamed: 0,ScSocialNbFollowers,ScsocialNbFollows,ScsocialProductsLiked,ScproductsListed,ScproductsSold,ScproductsPassRate,ScproductsWished,ScproductsBought,ScdaysSinceLastLogin,Scseniority
0,36.979467,0.029795,0.400925,12.636592,81.752629,8.610153,4.06616,0.355052,-2.730563,0.78568
1,42.130969,-0.008056,-0.013372,9.222179,79.871944,11.551273,-0.062026,-0.073718,-2.725775,0.833214
2,34.403717,0.08657,0.307017,16.051004,76.580745,10.963049,0.334915,1.212591,-2.730563,0.827273
3,32.858266,0.029795,0.052915,59.462818,71.408862,10.72776,0.215833,-0.073718,-2.725775,0.797563
4,42.130969,-0.008056,-0.02442,12.148818,58.714238,11.668918,-0.062026,-0.073718,-2.677895,-1.246433


In [11]:
dfScale.reset_index(drop = True, inplace =True)

In [12]:
dfScaled = pd.concat([dfProcessed,pd.DataFrame(dfScale , columns=scaledCols)],axis=1)
dfScaled.head()

Unnamed: 0,identifierHash,socialNbFollowers,socialNbFollows,socialProductsLiked,productsListed,productsSold,productsPassRate,productsWished,productsBought,civilityGenderId,...,ScSocialNbFollowers,ScsocialNbFollows,ScsocialProductsLiked,ScproductsListed,ScproductsSold,ScproductsPassRate,ScproductsWished,ScproductsBought,ScdaysSinceLastLogin,Scseniority
0,-1097895247965112460,147,10,77,26,174,74.0,104,1,1,...,36.979467,0.029795,0.400925,12.636592,81.752629,8.610153,4.06616,0.355052,-2.730563,0.78568
1,2347567364561867620,167,8,2,19,170,99.0,0,0,2,...,42.130969,-0.008056,-0.013372,9.222179,79.871944,11.551273,-0.062026,-0.073718,-2.725775,0.833214
2,6870940546848049750,137,13,60,33,163,94.0,10,3,2,...,34.403717,0.08657,0.307017,16.051004,76.580745,10.963049,0.334915,1.212591,-2.730563,0.827273
3,-4640272621319568052,131,10,14,122,152,92.0,7,0,2,...,32.858266,0.029795,0.052915,59.462818,71.408862,10.72776,0.215833,-0.073718,-2.725775,0.797563
4,-5175830994878542658,167,8,0,25,125,100.0,0,0,2,...,42.130969,-0.008056,-0.02442,12.148818,58.714238,11.668918,-0.062026,-0.073718,-2.677895,-1.246433


In [13]:
dfScaled = dfScaled.drop(['socialNbFollowers',
       'socialNbFollows', 'socialProductsLiked', 'productsListed',
       'productsSold', 'productsPassRate', 'productsWished', 'productsBought',
       'daysSinceLastLogin',
       'seniority'], axis=1)
dfScaled.head()

Unnamed: 0,identifierHash,civilityGenderId,hasAnyApp,hasAndroidApp,hasIosApp,hasProfilePicture,countryCode,country_Afghanistan,country_Afrique du Sud,country_Albanie,...,ScSocialNbFollowers,ScsocialNbFollows,ScsocialProductsLiked,ScproductsListed,ScproductsSold,ScproductsPassRate,ScproductsWished,ScproductsBought,ScdaysSinceLastLogin,Scseniority
0,-1097895247965112460,1,1,0,1,1,gb,0,0,0,...,36.979467,0.029795,0.400925,12.636592,81.752629,8.610153,4.06616,0.355052,-2.730563,0.78568
1,2347567364561867620,2,1,0,1,1,mc,0,0,0,...,42.130969,-0.008056,-0.013372,9.222179,79.871944,11.551273,-0.062026,-0.073718,-2.725775,0.833214
2,6870940546848049750,2,1,0,1,0,fr,0,0,0,...,34.403717,0.08657,0.307017,16.051004,76.580745,10.963049,0.334915,1.212591,-2.730563,0.827273
3,-4640272621319568052,2,1,0,1,0,us,0,0,0,...,32.858266,0.029795,0.052915,59.462818,71.408862,10.72776,0.215833,-0.073718,-2.725775,0.797563
4,-5175830994878542658,2,0,0,0,1,us,0,0,0,...,42.130969,-0.008056,-0.02442,12.148818,58.714238,11.668918,-0.062026,-0.073718,-2.677895,-1.246433


In [14]:
dfScaled.describe()

Unnamed: 0,identifierHash,civilityGenderId,hasAnyApp,hasAndroidApp,hasIosApp,hasProfilePicture,country_Afghanistan,country_Afrique du Sud,country_Albanie,country_Algérie,...,ScSocialNbFollowers,ScsocialNbFollows,ScsocialProductsLiked,ScproductsListed,ScproductsSold,ScproductsPassRate,ScproductsWished,ScproductsBought,ScdaysSinceLastLogin,Scseniority
count,98913.0,98913.0,98913.0,98913.0,98913.0,98913.0,98913.0,98913.0,98913.0,98913.0,...,98913.0,98913.0,98913.0,98913.0,98913.0,98913.0,98913.0,98913.0,98913.0,98913.0
mean,-6692039000000000.0,1.773993,0.264616,0.04872,0.217636,0.980842,0.000101,0.000839,0.000374,0.000768,...,-5.833643e-15,1.614438e-14,1.433288e-15,-1.883057e-14,-9.818334e-14,-1.039872e-14,5.542929e-14,-5.976973e-14,-1.848443e-14,-2.188985e-13
std,5.330807e+18,0.428679,0.441131,0.215282,0.41264,0.137082,0.010054,0.028956,0.019337,0.027709,...,1.000005,1.000005,1.000005,1.000005,1.000005,1.000005,1.000005,1.000005,1.000005,1.000005
min,-9.223101e+18,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,-0.1113417,-0.1594585,-0.02442,-0.0455113,-0.05716892,-0.0955632,-0.06202581,-0.0737179,-2.730563,-1.258317
25%,-4.622895e+18,2.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,-0.1113417,-0.008056069,-0.02442,-0.0455113,-0.05716892,-0.0955632,-0.06202581,-0.0737179,-0.04448657,-1.228607
50%,-1337989000000000.0,2.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,-0.1113417,-0.008056069,-0.02442,-0.0455113,-0.05716892,-0.0955632,-0.06202581,-0.0737179,0.5396512,0.7856796
75%,4.616388e+18,2.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,-0.1113417,-0.008056069,-0.02442,-0.0455113,-0.05716892,-0.0955632,-0.06202581,-0.0737179,0.5779553,0.8153889
max,9.223331e+18,3.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,190.7518,260.3285,285.404,118.9711,81.75263,11.66892,104.5319,173.578,0.6114714,0.8391563


Splitting dataframe into dependent and independent variables

Seller dataframe splitting

In [15]:
Xseller = dfScaled.drop(['identifierHash','countryCode', 'ScproductsSold'], axis =1)

yseller = dfScaled['ScproductsSold']

Xseller_train, Xseller_test, yseller_train, yseller_test = train_test_split(Xseller, yseller, test_size = 0.25, random_state = 42)

Xseller_train.to_csv(r'/Users/oluwafemibabatunde/Desktop/Springboard/capstone_three/C2C Business Analytics/data/Xseller_train.csv')

Xseller_test.to_csv(r'/Users/oluwafemibabatunde/Desktop/Springboard/capstone_three/C2C Business Analytics/data/Xseller_test.csv')

yseller_train = pd.DataFrame(yseller_train)
yseller_test = pd.DataFrame(yseller_test)

yseller_train.to_csv(r'/Users/oluwafemibabatunde/Desktop/Springboard/capstone_three/C2C Business Analytics/data/yseller_train.csv')

yseller_test.to_csv(r'/Users/oluwafemibabatunde/Desktop/Springboard/capstone_three/C2C Business Analytics/data/yseller_test.csv')

Buyer dataframe splitting

In [16]:
Xbuyer = dfScaled.drop(['identifierHash', 'countryCode','ScproductsBought'], axis =1)

ybuyer = dfScaled['ScproductsBought']

Xbuyer_train, Xbuyer_test, ybuyer_train, ybuyer_test = train_test_split(Xbuyer, ybuyer, test_size = 0.25, random_state = 42)

Xbuyer_train.to_csv(r'/Users/oluwafemibabatunde/Desktop/Springboard/capstone_three/C2C Business Analytics/data/Xbuyer_train.csv')

Xbuyer_test.to_csv(r'/Users/oluwafemibabatunde/Desktop/Springboard/capstone_three/C2C Business Analytics/data/Xbuyer_test.csv')

ybuyer_train = pd.DataFrame(ybuyer_train)
ybuyer_test = pd.DataFrame(ybuyer_test)

ybuyer_train.to_csv(r'/Users/oluwafemibabatunde/Desktop/Springboard/capstone_three/C2C Business Analytics/data/ybuyer_train.csv')

ybuyer_test.to_csv(r'/Users/oluwafemibabatunde/Desktop/Springboard/capstone_three/C2C Business Analytics/data/ybuyer_test.csv')

The original processed dataframe will be split also

In [17]:
XPseller = dfProcessed.drop(['identifierHash','countryCode', 'productsSold'], axis =1)

yPseller = dfProcessed['productsSold']

XPseller_train, XPseller_test, yPseller_train, yPseller_test = train_test_split(XPseller, yPseller, test_size = 0.25, random_state = 42)

XPseller_train.to_csv(r'/Users/oluwafemibabatunde/Desktop/Springboard/capstone_three/C2C Business Analytics/data/XPseller_train.csv')

XPseller_test.to_csv(r'/Users/oluwafemibabatunde/Desktop/Springboard/capstone_three/C2C Business Analytics/data/XPseller_test.csv')

yPseller_train = pd.DataFrame(yPseller_train)
yPseller_test = pd.DataFrame(yPseller_test)

yPseller_train.to_csv(r'/Users/oluwafemibabatunde/Desktop/Springboard/capstone_three/C2C Business Analytics/data/yPseller_train.csv')

yPseller_test.to_csv(r'/Users/oluwafemibabatunde/Desktop/Springboard/capstone_three/C2C Business Analytics/data/yPseller_test.csv')

In [18]:
XPbuyer = dfProcessed.drop(['identifierHash','countryCode', 'productsBought'], axis =1)

yPbuyer = dfProcessed['productsBought']

XPbuyer_train, XPbuyer_test, yPbuyer_train, yPbuyer_test = train_test_split(XPbuyer, yPbuyer, test_size = 0.25, random_state = 42)

XPbuyer_train.to_csv(r'/Users/oluwafemibabatunde/Desktop/Springboard/capstone_three/C2C Business Analytics/data/XPbuyer_train.csv')

XPbuyer_test.to_csv(r'/Users/oluwafemibabatunde/Desktop/Springboard/capstone_three/C2C Business Analytics/data/XPbuyer_test.csv')

yPbuyer_train = pd.DataFrame(yPbuyer_train)
yPbuyer_test = pd.DataFrame(yPbuyer_test)

yPbuyer_train.to_csv(r'/Users/oluwafemibabatunde/Desktop/Springboard/capstone_three/C2C Business Analytics/data/yPbuyer_train.csv')

yPbuyer_test.to_csv(r'/Users/oluwafemibabatunde/Desktop/Springboard/capstone_three/C2C Business Analytics/data/yPbuyer_test.csv')