In [5]:
import pandas as pd
from pathlib import Path

In [8]:
file_path = Path("../Resources/shopping_data.csv")
df_shopping = pd.read_csv(file_path, encoding = "ISO-8859-1")
df_shopping.head()

Unnamed: 0,CustomerID,Card Member,Age,Annual Income,Spending Score (1-100)
0,1,Yes,19.0,15000,39.0
1,2,Yes,21.0,15000,81.0
2,3,No,20.0,16000,6.0
3,4,No,23.0,16000,77.0
4,5,No,31.0,17000,40.0


Questions for Data Preparation
Unsupervised learning doesn't have a clear outcome or target variable like supervised learning, but it is used to find patterns. By properly preparing the data, we can select features that help us find patterns or groups.

Before we begin, consider these questions:

    What knowledge do we hope to glean from running an unsupervised learning model on this dataset?
    What data is available? What type? What is missing? What can be removed?
    Is the data in a format that can be passed into an unsupervised learning model?
    Can I quickly hand off this data for others to use?
    Let's address the first question on our list:

What knowledge do we hope to glean from running an unsupervised learning model on this dataset?

It's a shopping dataset, so we can group together shoppers based on spending habits.

## 18.2.4 What data is available?

In [9]:
#Output the Columns
df_shopping.columns

Index(['CustomerID', 'Card Member', 'Age', 'Annual Income',
       'Spending Score (1-100)'],
      dtype='object')

Now that we know what data we have, we can start thinking about possible analysis. For example, data points for features like Age and Annual Income might appear in our end result as groupings or clusters. However, there are no data points for items purchased, so our algorithms cannot discover related patterns.

## What type of data is avaliable?

Using the dtypes method, confirm the data type, which also will alert us if anything should be changed in the next step (e.g., converting text to numerical data). All the columns we plan to use in our model must contain a numerical data type:

In [10]:
# List dataframe data types
df_shopping.dtypes

CustomerID                  int64
Card Member                object
Age                       float64
Annual Income               int64
Spending Score (1-100)    float64
dtype: object

## What data is missing?

Unsupervised learning models can't handle missing data. If you try to run a model on a dataset with missing data, you'll get an error such as the one below:

**ValueError: Input contains NaN, infinity or a value too large for dtype('float64').**

    -There is no set cutoff for missing dataâ€”that decision is left up to you, the analyst, and must be made based on your understanding of the business needs.

Pandas has the isnull() method to check for missing values. We'll loop through each column, check if there are null values, sum them up, and print out a readable total:

In [18]:
# Find null values
for column in df_shopping.columns:
    print(f"Column {column} has {df_shopping[column].isnull().sum()}.\
    null values")

Column CustomerID has 0.    null values
Column Card Member has 2.    null values
Column Age has 2.    null values
Column Annual Income has 0.    null values
Column Spending Score (1-100) has 1.    null values


There will be a few rows with missing values that we'll need to handle. The judgement call will be to either remove these rows or decide that the dataset is not suitable for our model. In this case, we'll proceed with handling these values because they are a small percentage of the overall data.

In [20]:
# Drop rows with null values
df_shopping = df_shopping.dropna()

In [22]:
# Check for duplicated rows

print(f"Duplicated entries {df_shopping.duplicated().sum()}")

Duplicated entries 0


In [23]:
df_shopping.head()

Unnamed: 0,CustomerID,Card Member,Age,Annual Income,Spending Score (1-100)
0,1,Yes,19.0,15000,39.0
1,2,Yes,21.0,15000,81.0
2,3,No,20.0,16000,6.0
3,4,No,23.0,16000,77.0
4,5,No,31.0,17000,40.0


In [24]:
# Drop Costumer ID

df_shopping.drop(columns= ["CustomerID"], inplace = True)
df_shopping.head()

Unnamed: 0,Card Member,Age,Annual Income,Spending Score (1-100)
0,Yes,19.0,15000,39.0
1,Yes,21.0,15000,81.0
2,No,20.0,16000,6.0
3,No,23.0,16000,77.0
4,No,31.0,17000,40.0


## 18.2.5 Data Processing

### Is the data in a format that can be passed into an unsupervised learning model?
***To make sure we can use our string data, we'll transform our strings of Yes and No from the Card Member column to 1 and 0, respectively, by creating a function that will convert Yes to a 1 and anything else to 0.***

The function will then be run on the whole column with the .apply method, as shown below:

In [26]:
# Transform String Column
def change_string(member):
    if member == "Yes":
        return 1
    else:
        return 0

In [29]:
df_shopping["Card Member"] = df_shopping["Card Member"].apply(change_string)
df_shopping.head()

Unnamed: 0,Card Member,Age,Annual Income,Spending Score (1-100)
0,1,19.0,15000,39.0
1,1,21.0,15000,81.0
2,0,20.0,16000,6.0
3,0,23.0,16000,77.0
4,0,31.0,17000,40.0


In [42]:
#Reformat the names of the columns so they contain no spaces or numbers.

df_shopping.columns = df_shopping.columns.str.replace(' ', '')

df_shopping = df_shopping.rename(columns = {'SpendingScore(1-100)':'SpendingScore'})

df_shopping.head()

Unnamed: 0,CardMember,Age,AnnualIncome,SpendingScore
0,1,19.0,15000,39.0
1,1,21.0,15000,81.0
2,0,20.0,16000,6.0
3,0,23.0,16000,77.0
4,0,31.0,17000,40.0


## 18.2.6 Data Transformation

### Can I quickly hand off this data for others to use?

The data now needs to be transformed back into a more user-friendly format. It would be nice if everyone was as great with DataFrames as you two; unfortunately, that is not the case. You'll want to convert the final product into a common data type like CSV or Excel files.

In [43]:
# Saving cleaned data
file_path = "../Resources/shopping_data_cleaned.csv"
df_shopping.to_csv(file_path, index=False)

Now you know the questions to ask about your data and understand the Pandas processes used to help answer those questions. Different datasets have different issues. With practice, you'll get better at identifying these.