## Shop Customer Data

#### Research questions
1. Whats the profession of most of our customers
2. Whats the age group of most of our customers
3. What is the gender of most of our customers?
4. What is the average income of our customers?
5. Are our customers low income or High income earners?
6. Whats their average family size?


Note
Early Adult age between 19 and 39
Middle Adult are ages between 40 and 59
Old Age are ages from 60 above  
Low Earner is income below 70000:
Average Earner is income between 70000 and 120045:
High Earner is income above 120045

In [1]:
# Importing relevant libraries
import pandas as pd

In [2]:
# loading Dataset
df_customer_data = pd.read_csv('Customers.csv')

In [3]:
# making a copy of the customers dataset
df = df_customer_data.copy()

In [4]:
df.head()

Unnamed: 0,CustomerID,Gender,Age,Annual Income ($),Spending Score (1-100),Profession,Work Experience,Family Size
0,1,Male,19,15000,39,Healthcare,1,4
1,2,Male,21,35000,81,Engineer,3,3
2,3,Female,20,86000,6,Engineer,1,1
3,4,Female,23,59000,77,Lawyer,0,2
4,5,Female,31,38000,40,Entertainment,2,6


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 8 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   CustomerID              2000 non-null   int64 
 1   Gender                  2000 non-null   object
 2   Age                     2000 non-null   int64 
 3   Annual Income ($)       2000 non-null   int64 
 4   Spending Score (1-100)  2000 non-null   int64 
 5   Profession              1965 non-null   object
 6   Work Experience         2000 non-null   int64 
 7   Family Size             2000 non-null   int64 
dtypes: int64(6), object(2)
memory usage: 125.1+ KB


### Issues
1. Missing values in profession column; to be filled with unemployed
2. 24 records with ages 0, however, dataset to be analysed will be of age 20 and above
3. 2 records have a spending score of 0 on a scale of 1-100

In [6]:
# Checking for null values
df.isnull().sum()

CustomerID                 0
Gender                     0
Age                        0
Annual Income ($)          0
Spending Score (1-100)     0
Profession                35
Work Experience            0
Family Size                0
dtype: int64

In [7]:
# Assume profession records not filled are unemployed
df['Profession']= df.Profession.fillna('Unemployed')

# To check if there are still null values
df.isnull().sum()

CustomerID                0
Gender                    0
Age                       0
Annual Income ($)         0
Spending Score (1-100)    0
Profession                0
Work Experience           0
Family Size               0
dtype: int64

In [8]:
#dropping records with spending score of zerp
df.drop(df[df['Spending Score (1-100)'] == 0].index, axis=0, inplace=True)

In [10]:
# Describing the numerical fields of the dataset
df.describe()

Unnamed: 0,CustomerID,Age,Annual Income ($),Spending Score (1-100),Work Experience,Family Size
count,1998.0,1998.0,1998.0,1998.0,1998.0,1998.0
mean,1001.274775,48.956456,110780.101602,51.013514,4.103604,3.765766
std,577.26326,28.420295,45736.878662,27.902027,3.922864,1.969265
min,1.0,0.0,0.0,1.0,0.0,1.0
25%,502.25,25.0,74727.0,28.0,1.0,2.0
50%,1001.5,48.0,110252.0,50.0,3.0,4.0
75%,1500.75,73.0,149094.25,75.0,7.0,5.0
max,2000.0,99.0,189974.0,100.0,17.0,9.0


In [11]:
# Creating the age category column using tenager, early adult, middle adult and old age as categories
Age_Category = []

for i in df['Age']:
    if i <= 19:
        Age_Category.append('Tenager')
    elif 19 < i < 39:
        Age_Category.append('Early Adult')
    elif 39 < i < 59:
        Age_Category.append('Middle Adult')
    else:
        Age_Category.append('Old Age')
    

df['Age Category'] = Age_Category


# Creating the age category column using low earner, average earner and high earner as categories
income_Category = []

for i in df['Annual Income ($)']:
    if i <= 70000:
        income_Category.append('Low Earner')
    elif 70000 < i < 120045:
        income_Category.append('Average Earner')
    else:
        income_Category.append('High Earner')
    

df['Income Category'] = income_Category

In [12]:
df.head()

Unnamed: 0,CustomerID,Gender,Age,Annual Income ($),Spending Score (1-100),Profession,Work Experience,Family Size,Age Category,Income Category
0,1,Male,19,15000,39,Healthcare,1,4,Tenager,Low Earner
1,2,Male,21,35000,81,Engineer,3,3,Early Adult,Low Earner
2,3,Female,20,86000,6,Engineer,1,1,Early Adult,Average Earner
3,4,Female,23,59000,77,Lawyer,0,2,Early Adult,Low Earner
4,5,Female,31,38000,40,Entertainment,2,6,Early Adult,Low Earner


In [13]:
# Minimum working age is assumed as 20 years
df_20 = df[df.Age > 19]
df_20.shape

(1619, 10)

In [14]:
df_20.to_csv('customer data.csv')