# Mercari Data Capstone Project - Preprocessing and Training

Now that we've determined which variables we're interested in as a predictor of our target, it's time to reformat our data into a form that will fit the models we're about to create. We will standardize the magnitude of our numeric features, and create dummy features for our categorical variables.

First let's get the usual beginning work out of the way: import statements, reading in the data set, and checking to see if it's in good shape.

In [1]:
# Imports
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import preprocessing

In [2]:
# Reading in our data set
path = "/Users/jasonzhou/Documents/GuidedCapstone2"
os.chdir(path)

mercari_data = pd.read_csv("MercariDataCleaned.csv")
mercari_data = mercari_data.drop(columns=['Unnamed: 0'])

pd.set_option('display.float_format', lambda x: '%.5f' % x)

mercari_data.head()

Unnamed: 0,Condition,Category,Brand,Price,Shipping
0,5,Women/Tops & Blouses/Blouse,Unbranded,21,0
1,5,"Electronics/TV, Audio & Surveillance/Headphones",Unbranded,10,1
2,5,Beauty/Makeup/Lips,Sephora,14,1
3,5,Beauty/Makeup/Lips,Sephora,14,0
4,4,Beauty/Makeup/Makeup Sets,Unbranded,5,1


In [3]:
mercari_data.shape

(1048575, 5)

In [4]:
# Checking for missing values
missing = pd.concat([mercari_data.isnull().sum(), 100 * mercari_data.isnull().mean()], axis=1)
missing.columns=['count', '%']
missing.sort_values(by='count', ascending=False)

Unnamed: 0,count,%
Condition,0,0.0
Category,0,0.0
Brand,0,0.0
Price,0,0.0
Shipping,0,0.0


Based on the results of our data analysis, we've already decided that the Condition column is no longer worth looking at analytically. Therefore, excluding that and our target column Price, there are only 3 columns here for us to work on: Brand, Category, and Shipping.

In [5]:
# final dataset will be defined in df

df = mercari_data[['Category', 'Brand', 'Price', 'Shipping']]

When we determine our training and test data, we call train_test_split on X and y. y in this case is apparent, it will simply be the Price column, or df['Price']. X will be comprised of our Brand, Category, and Shipping columns all in dummy variable form.

In [6]:
# Forming our final X

X = pd.get_dummies(df[['Brand', 'Category', 'Shipping']])

# verifying dimensions. should have 4384 + 1259 + 2 = 5645 rows

X.shape

(1048575, 5644)

In [7]:
# Splitting the data into train and test
y = df['Price']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=1)

In [None]:
# Scaling features
scaler = preprocessing.StandardScaler().fit(X_train)
X_train_scaled=scaler.transform(X_train)
X_test_scaled=scaler.transform(X_test)