# Preprocessing for Mobile Category

The preprocessing for other categories, fashion and beauty, is the same as the mobile category. Hence, we will just show the preprocessing steps for mobile category, to avoid long repetition of the code. 

In [1]:
# Import the relevant libraries
import numpy as np
import pandas as pd
import os
import shutil
from tqdm import tqdm
import glob

In [3]:
# Read the csv file into train dataframe
train=pd.read_csv('ndsc_dataset/train.csv')

In [4]:
train['Full Category'], train['Filename'] = train['image_path'].str.split('/', 1).str
train.drop('image_path', axis=1, inplace=True)
print("We are expecting", train.shape[0],"images in the entire dataset")

We are expecting 666615 images in the entire dataset


In [5]:
train['Full Category'].unique()

array(['beauty_image', 'fashion_image', 'mobile_image'], dtype=object)

A brief look shows there are three categories in the data

In [6]:
mobile=train[train['Full Category']=='mobile_image']
print("We are expecting", mobile.shape[0],"images in the mobile training dataset")

We are expecting 160330 images in the mobile training dataset


### In summary, an overview of the dataset shows there are 666615 images in the entire dataset. And mobile category consist of 160330 images.

In [7]:
mobile=mobile.sort_values(by=['Filename'], ascending = True)
mobile.head()

Unnamed: 0,itemid,title,Category,Full Category,Filename
522051,1559229230,jika minat silahkan wa 0831 4044 8453 promo di...,31,mobile_image,00018defe03935e545929201b8eec50a.jpg
596952,1022569612,iphone 5s 32gb grey ibox tam ses trikomsel gar...,31,mobile_image,0003ae5382360f19c88e83bc3d13e93d.jpg
559422,1746034004,promo big sale mito a19 4g 5 lte 2 16gb ready ...,46,mobile_image,0004390752e0372ebe568d929d463837.jpg
545052,1258509983,flexible home tombol iphone 5s putih best seller,31,mobile_image,000445556b7323a60a2e8a0a7525d77a.jpg
554871,978529440,microsoft surface pro 4 ci5 6300u 8gb 256gb ss...,35,mobile_image,000477490f1e24c1aa9933b88fad14f4.jpg


### Import mobile images from the dataset

Below we import the mobile images from the dataset given by Shopee, and we filter out those that are non-test images from the training images. 

In [8]:
#change your file name to the directory that all your images are stored into
mobile_img=[]
os.chdir("/Users/matthewhan/Desktop/NDSC/mobile_image")
for file in glob.glob("*.jpg"):
    mobile_img.append(file)

print("There are a total of", len(mobile_img),"mobile images in the mobile dataset")

There are a total of 200747 mobile images in the mobile dataset


In [9]:
mobile_img_df=pd.DataFrame(mobile_img)
mobile_img_df.columns=['Filename']
mobile_img_df = mobile_img_df.sort_values(by=['Filename'], ascending = True)
mobile_img_df.head()

Unnamed: 0,Filename
28474,0000456f97a4805ba4960084ffc8c058.jpg
171220,00018defe03935e545929201b8eec50a.jpg
93034,0003ae5382360f19c88e83bc3d13e93d.jpg
124412,0003cb2b0c95619f009df611f77e0cf1.jpg
48364,0004390752e0372ebe568d929d463837.jpg


### Merging the dataframes

We do a right join to match the two dataframes, so that we know which are the training images and the test images.

In [10]:
mobile_comb=pd.merge(mobile, mobile_img_df, on=['Filename'], how='right')
mobile_comb.shape
# Get the training images
train_mobile=mobile_comb.dropna()
print("Number of train images:", train_mobile.shape[0])

Number of train images: 160330


In [11]:
train_mobile.head()

Unnamed: 0,itemid,title,Category,Full Category,Filename
0,1559229000.0,jika minat silahkan wa 0831 4044 8453 promo di...,31.0,mobile_image,00018defe03935e545929201b8eec50a.jpg
1,1022570000.0,iphone 5s 32gb grey ibox tam ses trikomsel gar...,31.0,mobile_image,0003ae5382360f19c88e83bc3d13e93d.jpg
2,1746034000.0,promo big sale mito a19 4g 5 lte 2 16gb ready ...,46.0,mobile_image,0004390752e0372ebe568d929d463837.jpg
3,1258510000.0,flexible home tombol iphone 5s putih best seller,31.0,mobile_image,000445556b7323a60a2e8a0a7525d77a.jpg
4,978529400.0,microsoft surface pro 4 ci5 6300u 8gb 256gb ss...,35.0,mobile_image,000477490f1e24c1aa9933b88fad14f4.jpg


In [12]:
train_mobile['Category'] = train_mobile.Category.astype('int')
train_mobile.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 160330 entries, 0 to 160329
Data columns (total 5 columns):
itemid           160330 non-null float64
title            160330 non-null object
Category         160330 non-null int64
Full Category    160330 non-null object
Filename         160330 non-null object
dtypes: float64(1), int64(1), object(3)
memory usage: 7.3+ MB


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [13]:
mobile_categories = train_mobile['Category'].unique()

### Extracting the images  
Here we do a for loop to extract the images into the destination folder for training.

In [14]:
for cat in tqdm(mobile_categories):
    category = train_mobile[train_mobile.Category == cat]
    category_pics=category['Filename'].tolist()
    temp = '/Users/matthewhan/Desktop/NDSC/mobile_by_category/' + str(cat) 
    for pic in category_pics:
        shutil.copy(pic, temp)

100%|██████████| 27/27 [1:30:50<00:00,  9.56s/it]   
