# Company 2 Company Business Analytics


This project will work on a fashion customer-to-customer (C2C) e-commerce platform that
enables users to sell products to other users on the platform. The performance of a seller on the
platform does not only depend on the user interface of the C2C platform, but it is also a function
of the type of products uploaded by the seller, the image quality of the product, its description,
customer service and social engagement of sellers (The factors are not limited to those stated).
The following insights will be drawn from the available data:
• Factors that contribute to sellers being able to generate good sales from the e-commerce
platform
• What is the typical lifetime value of a customer on the platform?
• What is the average retention rate of buyers on the e-commerce platform?
• Considering that the platform is situated in France, what is the tendency that other users
from other countries will sign up on the platform.
• How active are users generally on the e-commerce platform?

## Data Wrangling
### Steps that will be involved in this data wrangling stage include the following:
### 1. Data Collection
    * Locating the data
    * Data loading
    * Data joining
2. Data Organization
    * File structure
    * Git & Github
3. Data Definition
    * Column names
    * Data types (numeric, categorical, timestamp, etc.)
    * Description of the columns
    * Count or percent per unique values or codes (including NA)
    * The range of values or codes
4. Data Cleaning
    * NA or missing data
    * Duplicates

### Importing Packages

In [1]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import datetime
from langdetect import detect, DetectorFactory
from textblob import TextBlob
from collections import Counter
from sklearn.feature_extraction.text import TfidfVectorizer

In [2]:
os.getcwd()

'/Users/oluwafemibabatunde'

In [3]:
path = '/Users/oluwafemibabatunde/Desktop/Springboard/C2C Business France'
os.chdir(path)
df = pd.read_csv('6M-0K-99K.users.dataset.public.csv')

In [4]:
df.head()

Unnamed: 0,identifierHash,type,country,language,socialNbFollowers,socialNbFollows,socialProductsLiked,productsListed,productsSold,productsPassRate,...,civilityTitle,hasAnyApp,hasAndroidApp,hasIosApp,hasProfilePicture,daysSinceLastLogin,seniority,seniorityAsMonths,seniorityAsYears,countryCode
0,-1097895247965112460,user,Royaume-Uni,en,147,10,77,26,174,74.0,...,mr,True,False,True,True,11,3196,106.53,8.88,gb
1,2347567364561867620,user,Monaco,en,167,8,2,19,170,99.0,...,mrs,True,False,True,True,12,3204,106.8,8.9,mc
2,6870940546848049750,user,France,fr,137,13,60,33,163,94.0,...,mrs,True,False,True,False,11,3203,106.77,8.9,fr
3,-4640272621319568052,user,Etats-Unis,en,131,10,14,122,152,92.0,...,mrs,True,False,True,False,12,3198,106.6,8.88,us
4,-5175830994878542658,user,Etats-Unis,en,167,8,0,25,125,100.0,...,mrs,False,False,False,True,22,2854,95.13,7.93,us


Creating file structure for data organization

In [5]:
datapath = "/Users/oluwafemibabatunde/Desktop/Springboard/capstone_three/C2C Business Analytics/data"
if not os.path.isdir(datapath):
   os.makedirs(datapath)

In [6]:
figpath = "/Users/oluwafemibabatunde/Desktop/Springboard/capstone_three/C2C Business Analytics/figures"
if not os.path.isdir(figpath):
   os.makedirs(figpath)

In [7]:
modelpath = "/Users/oluwafemibabatunde/Desktop/Springboard/capstone_three/C2C Business Analytics/model"
if not os.path.isdir(modelpath):
   os.makedirs(modelpath)

Data Definition 

Column Names 

In [8]:
df.columns

Index(['identifierHash', 'type', 'country', 'language', 'socialNbFollowers',
       'socialNbFollows', 'socialProductsLiked', 'productsListed',
       'productsSold', 'productsPassRate', 'productsWished', 'productsBought',
       'gender', 'civilityGenderId', 'civilityTitle', 'hasAnyApp',
       'hasAndroidApp', 'hasIosApp', 'hasProfilePicture', 'daysSinceLastLogin',
       'seniority', 'seniorityAsMonths', 'seniorityAsYears', 'countryCode'],
      dtype='object')

Dataframe shape

In [9]:
df.shape

(98913, 24)

Field's Description 

In [10]:
field_description = {'identifierHash': 'Hash of User ID', 'type': 'The entity type',
'country': 'User\'s Country (written in French)', 'language': 'The User\'s Preferred language',
'socialNBFollowers': 'Number of users who subscribed to this user\'s activity. New accounts are automatically followed by the store\'s official',
'socialNbFollows': 'Number of user account this user follows. New accounts are automatically assigned to follow the official partners',
'socialProductsLiked': 'Number of products this user liked',
'productsListed': 'Number of currently unsold products that this user has uploaded.',
'productsSold': 'Number of products this user has sold',
'productsPassRate': '% of products meeting the product description. (Sold products are reviewed by the store\'s team before being shipped to the',
'productsWished': 'Number of products this user added to his/her wishlist.',
'productsBought': 'Number of products this user bought',
'gender': 'user\'s gender',
'civilityGenderId': 'civility as integer',
'civilityTitle': 'Civility Title', 'hasAnyApp': 'user has ever used any of the store\'s official app',
       'hasAndroidApp': 'user has ever used the official Android app', 'hasIosApp': 'user has ever used the official iOS app', 'hasProfilePicture':'user has a custom profile picture', 'daysSinceLastLogin':'Number of days since the last login',
       'seniority':'Number of days since the user registered', 'seniorityAsMonths': 'See seniority in months', 'seniorityAsYears':'See seniority in years', 'countryCode':'user\'s country (ISO-3166-1)'}
fields = pd.DataFrame.from_dict(field_description, orient = 'Index')
fields.index.name = 'Fields'
fields = fields.rename(columns={0:'Description'})
fields

Unnamed: 0_level_0,Description
Fields,Unnamed: 1_level_1
identifierHash,Hash of User ID
type,The entity type
country,User's Country (written in French)
language,The User's Preferred language
socialNBFollowers,Number of users who subscribed to this user's ...
socialNbFollows,Number of user account this user follows. New ...
socialProductsLiked,Number of products this user liked
productsListed,Number of currently unsold products that this ...
productsSold,Number of products this user has sold
productsPassRate,% of products meeting the product description....


Check the data types and other info of the dataframe fields

In [11]:
df.dtypes

identifierHash           int64
type                    object
country                 object
language                object
socialNbFollowers        int64
socialNbFollows          int64
socialProductsLiked      int64
productsListed           int64
productsSold             int64
productsPassRate       float64
productsWished           int64
productsBought           int64
gender                  object
civilityGenderId         int64
civilityTitle           object
hasAnyApp                 bool
hasAndroidApp             bool
hasIosApp                 bool
hasProfilePicture         bool
daysSinceLastLogin       int64
seniority                int64
seniorityAsMonths      float64
seniorityAsYears       float64
countryCode             object
dtype: object

In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 98913 entries, 0 to 98912
Data columns (total 24 columns):
identifierHash         98913 non-null int64
type                   98913 non-null object
country                98913 non-null object
language               98913 non-null object
socialNbFollowers      98913 non-null int64
socialNbFollows        98913 non-null int64
socialProductsLiked    98913 non-null int64
productsListed         98913 non-null int64
productsSold           98913 non-null int64
productsPassRate       98913 non-null float64
productsWished         98913 non-null int64
productsBought         98913 non-null int64
gender                 98913 non-null object
civilityGenderId       98913 non-null int64
civilityTitle          98913 non-null object
hasAnyApp              98913 non-null bool
hasAndroidApp          98913 non-null bool
hasIosApp              98913 non-null bool
hasProfilePicture      98913 non-null bool
daysSinceLastLogin     98913 non-null int64
seniorit

Number of unique entries in each field

In [13]:
df.nunique()

identifierHash         98913
type                       1
country                  200
language                   5
socialNbFollowers         90
socialNbFollows           85
socialProductsLiked      420
productsListed            65
productsSold              75
productsPassRate          72
productsWished           279
productsBought            70
gender                     2
civilityGenderId           3
civilityTitle              3
hasAnyApp                  2
hasAndroidApp              2
hasIosApp                  2
hasProfilePicture          2
daysSinceLastLogin       699
seniority                 19
seniorityAsMonths         19
seniorityAsYears           6
countryCode              199
dtype: int64

Check for % of fields in dataframe

In [14]:
df1 = df.nunique()
dfSize = df.size
percentage_df1 = (df1/dfSize)*100
print(percentage_df1)

identifierHash         4.166667
type                   0.000042
country                0.008425
language               0.000211
socialNbFollowers      0.003791
socialNbFollows        0.003581
socialProductsLiked    0.017692
productsListed         0.002738
productsSold           0.003159
productsPassRate       0.003033
productsWished         0.011753
productsBought         0.002949
gender                 0.000084
civilityGenderId       0.000126
civilityTitle          0.000126
hasAnyApp              0.000084
hasAndroidApp          0.000084
hasIosApp              0.000084
hasProfilePicture      0.000084
daysSinceLastLogin     0.029445
seniority              0.000800
seniorityAsMonths      0.000800
seniorityAsYears       0.000253
countryCode            0.008383
dtype: float64


Check duplicate rows in df

In [15]:
duplicateRowsDF = df[df.duplicated()]
duplicateRowsDF

Unnamed: 0,identifierHash,type,country,language,socialNbFollowers,socialNbFollows,socialProductsLiked,productsListed,productsSold,productsPassRate,...,civilityTitle,hasAnyApp,hasAndroidApp,hasIosApp,hasProfilePicture,daysSinceLastLogin,seniority,seniorityAsMonths,seniorityAsYears,countryCode


Check the number of missing data in fileds

In [16]:
df.isnull().sum()

identifierHash         0
type                   0
country                0
language               0
socialNbFollowers      0
socialNbFollows        0
socialProductsLiked    0
productsListed         0
productsSold           0
productsPassRate       0
productsWished         0
productsBought         0
gender                 0
civilityGenderId       0
civilityTitle          0
hasAnyApp              0
hasAndroidApp          0
hasIosApp              0
hasProfilePicture      0
daysSinceLastLogin     0
seniority              0
seniorityAsMonths      0
seniorityAsYears       0
countryCode            0
dtype: int64

In [17]:
df.isna().sum()

identifierHash         0
type                   0
country                0
language               0
socialNbFollowers      0
socialNbFollows        0
socialProductsLiked    0
productsListed         0
productsSold           0
productsPassRate       0
productsWished         0
productsBought         0
gender                 0
civilityGenderId       0
civilityTitle          0
hasAnyApp              0
hasAndroidApp          0
hasIosApp              0
hasProfilePicture      0
daysSinceLastLogin     0
seniority              0
seniorityAsMonths      0
seniorityAsYears       0
countryCode            0
dtype: int64

In [18]:
df.country.value_counts()

France                                    25135
Etats-Unis                                20602
Royaume-Uni                               11310
Italie                                     8015
Allemagne                                  6567
                                          ...  
Mayotte                                       1
Swaziland                                     1
Iles mineures éloignées des États-Unis        1
Sri Lanka                                     1
Saint Vincent et les Grenadines               1
Name: country, Length: 200, dtype: int64

In [19]:
df.gender.value_counts()

F    76121
M    22792
Name: gender, dtype: int64

In [20]:
!pip install translate
from translate import Translator



In [21]:
translator = Translator(to_lang="English", from_lang = "French")
translation1 = translator.translate("Royaume-Uni")
translation1
#Using this translator to translate the whole column was time consuming. I started running the tranlator for
#the whole column by 10:33am and as at 5:29pm it was still running.
#I will use French for the One hot code transformation and use it in the model.

'United Kingdom'

In [22]:
translation2 = translator.translate("Etats-Unis")
translation2

'United States'

In [23]:
df = pd.concat([df,pd.get_dummies(df['country'], prefix='country')],axis=1)

In [24]:
df = pd.concat([df,pd.get_dummies(df['gender'], prefix='gender')],axis=1)

In [25]:
df = pd.concat([df,pd.get_dummies(df['civilityTitle'], prefix='civilityTitle')],axis=1)

In [26]:
df = pd.concat([df,pd.get_dummies(df['language'], prefix='lang')],axis=1)
df.head()

Unnamed: 0,identifierHash,type,country,language,socialNbFollowers,socialNbFollows,socialProductsLiked,productsListed,productsSold,productsPassRate,...,gender_F,gender_M,civilityTitle_miss,civilityTitle_mr,civilityTitle_mrs,lang_de,lang_en,lang_es,lang_fr,lang_it
0,-1097895247965112460,user,Royaume-Uni,en,147,10,77,26,174,74.0,...,0,1,0,1,0,0,1,0,0,0
1,2347567364561867620,user,Monaco,en,167,8,2,19,170,99.0,...,1,0,0,0,1,0,1,0,0,0
2,6870940546848049750,user,France,fr,137,13,60,33,163,94.0,...,1,0,0,0,1,0,0,0,1,0
3,-4640272621319568052,user,Etats-Unis,en,131,10,14,122,152,92.0,...,1,0,0,0,1,0,1,0,0,0
4,-5175830994878542658,user,Etats-Unis,en,167,8,0,25,125,100.0,...,1,0,0,0,1,0,1,0,0,0


Converted Boolean Fields to Integer Type for Analysis purpose.

In [27]:
df['hasAnyApp'] = df['hasAnyApp'].astype(int)
df['hasAndroidApp'] = df['hasAndroidApp'].astype(int)
df['hasIosApp'] = df['hasIosApp'].astype(int)
df['hasProfilePicture'] = df['hasProfilePicture'].astype(int)

The original columns of country, gender, civilityTitle and language will be dropped

In [28]:
df = df.drop(['country', 'gender', 'civilityTitle', 'language'], axis = 1)

The type column will be dropped because it has only one entry. It will not be relevant to the model 

In [29]:
df = df.drop(['type'], axis = 1)

In [30]:
df.describe()

Unnamed: 0,identifierHash,socialNbFollowers,socialNbFollows,socialProductsLiked,productsListed,productsSold,productsPassRate,productsWished,productsBought,civilityGenderId,...,gender_F,gender_M,civilityTitle_miss,civilityTitle_mr,civilityTitle_mrs,lang_de,lang_en,lang_es,lang_fr,lang_it
count,98913.0,98913.0,98913.0,98913.0,98913.0,98913.0,98913.0,98913.0,98913.0,98913.0,...,98913.0,98913.0,98913.0,98913.0,98913.0,98913.0,98913.0,98913.0,98913.0,98913.0
mean,-6692039000000000.0,3.432269,8.425677,4.420743,0.093304,0.121592,0.812303,1.562595,0.171929,1.773993,...,0.769575,0.230425,0.004418,0.230425,0.765157,0.072569,0.521307,0.060993,0.266618,0.078513
std,5.330807e+18,3.882383,52.839572,181.030569,2.050144,2.126895,8.500205,25.192793,2.332266,0.428679,...,0.421107,0.421107,0.066322,0.421107,0.423903,0.259429,0.499548,0.239319,0.442193,0.268979
min,-9.223101e+18,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,-4.622895e+18,3.0,8.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,...,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
50%,-1337989000000000.0,3.0,8.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,...,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0
75%,4.616388e+18,3.0,8.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,...,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0
max,9.223331e+18,744.0,13764.0,51671.0,244.0,174.0,100.0,2635.0,405.0,3.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [None]:
df.to_csv(r'/Users/oluwafemibabatunde/Desktop/Springboard/capstone_three/C2C Business Analytics/data/step1DF_output.csv')

It should be noted that this dataset was cleaned from source. Some steps were not done because of the clean nature of the dataset. The dataset will be explored visually to enable us gain more insight on the relationships that exist within the dataset. Segmentation will also be caried out on the dataset to know the proportion of the dataset that can be categorized as seller and buyers respectively