# Preprocessing and Cleaning data

In the Preprocessing file, we have already discussed various methods to call. In this example we are going to dive deeper and understand how to deal with categorical variables.

In [1]:
import pandas as pd
import numpy as np


# Load the dataset
data = pd.read_csv('shopping_trends.csv')

# Check for missing values
missing_values = data.isnull().sum()
print(missing_values)

Customer ID                 0
Age                         0
Gender                      0
Item Purchased              0
Category                    0
Purchase Amount (USD)       0
Location                    0
Size                        0
Color                       0
Season                      0
Review Rating               0
Subscription Status         0
Payment Method              0
Shipping Type               0
Discount Applied            0
Promo Code Used             0
Previous Purchases          0
Preferred Payment Method    0
Frequency of Purchases      0
dtype: int64


This dataset has no missing values. Let's get info about it.

In [2]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3900 entries, 0 to 3899
Data columns (total 19 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Customer ID               3900 non-null   int64  
 1   Age                       3900 non-null   int64  
 2   Gender                    3900 non-null   object 
 3   Item Purchased            3900 non-null   object 
 4   Category                  3900 non-null   object 
 5   Purchase Amount (USD)     3900 non-null   int64  
 6   Location                  3900 non-null   object 
 7   Size                      3900 non-null   object 
 8   Color                     3900 non-null   object 
 9   Season                    3900 non-null   object 
 10  Review Rating             3900 non-null   float64
 11  Subscription Status       3900 non-null   object 
 12  Payment Method            3900 non-null   object 
 13  Shipping Type             3900 non-null   object 
 14  Discount

In [4]:
data.shape

(3900, 19)

In [9]:
data.head(200)

Unnamed: 0,Customer ID,Age,Gender,Item Purchased,Category,Purchase Amount (USD),Location,Size,Color,Season,Review Rating,Subscription Status,Payment Method,Shipping Type,Discount Applied,Promo Code Used,Previous Purchases,Preferred Payment Method,Frequency of Purchases
0,1,55,Male,Blouse,Clothing,53,Kentucky,L,Gray,Winter,3.1,Yes,Credit Card,Express,Yes,Yes,14,Venmo,Fortnightly
1,2,19,Male,Sweater,Clothing,64,Maine,L,Maroon,Winter,3.1,Yes,Bank Transfer,Express,Yes,Yes,2,Cash,Fortnightly
2,3,50,Male,Jeans,Clothing,73,Massachusetts,S,Maroon,Spring,3.1,Yes,Cash,Free Shipping,Yes,Yes,23,Credit Card,Weekly
3,4,21,Male,Sandals,Footwear,90,Rhode Island,M,Maroon,Spring,3.5,Yes,PayPal,Next Day Air,Yes,Yes,49,PayPal,Weekly
4,5,45,Male,Blouse,Clothing,49,Oregon,M,Turquoise,Spring,2.7,Yes,Cash,Free Shipping,Yes,Yes,31,PayPal,Annually
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
195,196,51,Male,Jacket,Outerwear,25,New York,M,Magenta,Fall,4.3,Yes,Credit Card,Free Shipping,Yes,Yes,34,Credit Card,Monthly
196,197,38,Male,Boots,Footwear,88,Washington,M,Lavender,Summer,3.9,Yes,Cash,Next Day Air,Yes,Yes,41,Credit Card,Fortnightly
197,198,59,Male,Scarf,Accessories,78,South Carolina,M,Black,Fall,3.2,Yes,Debit Card,2-Day Shipping,Yes,Yes,41,Credit Card,Monthly
198,199,57,Male,Jewelry,Accessories,45,Utah,M,Turquoise,Winter,4.8,Yes,Cash,Standard,Yes,Yes,39,Credit Card,Fortnightly


In [6]:
from sklearn.preprocessing import LabelEncoder

In [7]:
lEncoder = LabelEncoder()

In [12]:
X = data.iloc[:,9]
X

0       Winter
1       Winter
2       Spring
3       Spring
4       Spring
         ...  
3895    Summer
3896    Spring
3897    Spring
3898    Summer
3899    Spring
Name: Season, Length: 3900, dtype: object

In [26]:
X = lEncoder.fit_transform(X)

In [19]:
from sklearn.preprocessing import OneHotEncoder

In [35]:
X_reshaped = X.reshape(-1, 1)

# Apply OneHotEncoder
onehotencoder = OneHotEncoder()
X_encoded = onehotencoder.fit_transform(X_reshaped).toarray()
X_encoded

array([[0., 0., 0., 1.],
       [0., 0., 0., 1.],
       [0., 1., 0., 0.],
       ...,
       [0., 1., 0., 0.],
       [0., 0., 1., 0.],
       [0., 1., 0., 0.]])