# Mercari Data Capstone Project - Preprocessing and Training

Now that we've determined which variables we're interested in as a predictor of our target, it's time to reformat our data into a form that will fit the models we're about to create. We will standardize the magnitude of our numeric features, and create dummy features for our categorical variables.

First let's get the usual beginning work out of the way: import statements, reading in the data set, and checking to see if it's in good shape.

In [1]:
# Imports
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import preprocessing

In [2]:
# Reading in our data set
path = "/Users/jasonzhou/Documents/GuidedCapstone2"
os.chdir(path)

mercari_data = pd.read_csv("MercariDataCleaned.csv")
mercari_data = mercari_data.drop(columns=['Unnamed: 0'])

pd.set_option('display.float_format', lambda x: '%.5f' % x)

mercari_data.head()

Unnamed: 0,Condition,Category,Brand,Price,Shipping
0,5,Beauty/Makeup/Lips,Sephora,3.80735,1
1,5,Beauty/Makeup/Lips,Sephora,3.80735,0
2,3,Electronics/Cell Phones & Accessories/Cell Pho...,Apple,6.30378,0
3,5,"Women/Athletic Apparel/Pants, Tights, Leggings",PINK,5.39232,1
4,3,Women/Dresses/Knee-Length,Customized & Personalized,4.80735,0


In [3]:
mercari_data.shape

(567252, 5)

In [4]:
# Checking for missing values
missing = pd.concat([mercari_data.isnull().sum(), 100 * mercari_data.isnull().mean()], axis=1)
missing.columns=['count', '%']
missing.sort_values(by='count', ascending=False)

Unnamed: 0,count,%
Condition,0,0.0
Category,0,0.0
Brand,0,0.0
Price,0,0.0
Shipping,0,0.0


Based on the results of our data analysis, we've already decided that the Condition column is no longer worth looking at analytically. Therefore, excluding that and our target column Price, there are only 3 columns here for us to work on: Brand, Category, and Shipping.

In [5]:
# final dataset will be defined in df

df = mercari_data[['Category', 'Brand', 'Price', 'Shipping']]

Due to the constraints of our computing power, we have to vastly cut down on the dataset to make training and adjusting models later run within a reasonable time.

In [6]:
df = df.sample(frac = 0.2)

In [7]:
df.shape

(113450, 4)

When we determine our training and test data, we call train_test_split on X and y. y in this case is apparent, it will simply be the Price column, or df['Price']. X will be comprised of our Brand, Category, and Shipping columns all in dummy variable form.

In [8]:
X = df[['Category', 'Brand', 'Shipping']]
y = df['Price']

In [9]:
X_enc = pd.get_dummies(X)

In [10]:
# Trimming Data down to a more workable size

X_trimmed, _ , y_trimmed, _ = train_test_split(X_enc, y, test_size=0.4, random_state=1)

In [11]:
X_trimmed.shape

(68070, 1092)

In [12]:
# Splitting the data into train and test

X_train, X_test, y_train, y_test = train_test_split(X_enc, y, test_size=.3, random_state=1)

In [13]:
# Read out all 4 sets of training/testing data to csv files so we don't have to repeat this work during the next phase

X_train.to_csv('MercariX_train.csv')
y_train.to_csv('Mercariy_train.csv')
X_test.to_csv('MercariX_test.csv')
y_test.to_csv('Mercariy_test.csv')