# Objective

Explore the dataset to identify differences between the customers of each product. You can also explore relationships between the different attributes of the customers. You can approach it from any other line of questioning that you feel could be relevant for the business. The idea is to get you comfortable working in Python.

You are expected to do the following :

1. Come up with a customer profile (characteristics of a customer) of the different products
2. Perform univariate and multivariate analyses
3. Generate a set of insights and recommendations that will help the company in targeting new customers.

Data Dictionary:

The data is about customers of the treadmill product(s) of a retail store called Cardio Good Fitness.
It contains the following variables:

1. Product          - The model no. of the treadmill
2. Age              -  Age of the customer in no of years
3. Gender           - Gender of the customer
4. Education        - Education of the customer in no. of years
5. Marital Status   - Marital status of the customer
6. Usage            - Avg. # times the customer wants to use the treadmill every week
7. Fitness          - Self rated fitness score of the customer (5 - very fit, 1 - very unfit)
8. Income           - Income of the customer
9. Miles            - Miles that a customer expects to run

**Types of Data**

1. Qualitative Data:
     * Nominal:
         * Gender
         * Marital Status
         * Product
     * Ordinal:
         * Fitness
2. Quantitative Data:
    * Discrete:
         * Income
         * Age
         * Education
    * Continuous:
         * Usage
         * Miles


# Importing libraries - pandas, numpy, seaborn, matplotlib.pyplot


In [2]:
import pandas as pd
from warnings import filterwarnings
filterwarnings(action='ignore')
from pandas.api.types import CategoricalDtype
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

import pandas_profiling
sns.set_style('whitegrid') # set the grid white
sns.set(color_codes=True)
pd.set_option('display.float_format', lambda float_num: '%.5f' % float_num) # Suppress numerical display in scientific notation

# Function Definitions

In [6]:
# EDA
def first_steps_eda(**kwargs) -> None:
  """
  Load and describe the data set
  :parameter data_frame : pandas dataframe
  """

  (df := kwargs.get("data_frame"))


  # get the size of dataframe
  print('[*] Data has {} samples and {} features.'.format(df.shape[0], df.shape[1]))
  print("-" * 100)
  print(f"[*] Features : {df.columns.to_list()}\n\n")  # get name of columns/features

  print("-" * 100)
  print("[*] Missing values :\n\n", df.isnull().sum().sort_values(ascending=False))
  print("-" * 100)
  print("[*] Percent of missing :\n\n",
        round(df.isna().sum() / df.isna().count() * 100, 2).sort_values(ascending=False))
  print("-" * 100)
  print("[*] Dataset Info :")
  print(df.info())
  print("-" * 100)
  print("[*] Unique Values: ")
  print(df.nunique())
  print("-" * 100)
  '''
  print("\n[*] Checking for Unique Values: ")
  for feature_name in df.columns.tolist():  # Check for the unique values in the data
    print("Unique values in the column '{}' are \n\n".format(feature_name), data[feature_name].unique())
    print("-" * 100)
  '''
  print("\n[*] Descriptive Statistics from the Data")
  print(f"{df.describe().T}")
  print("-" * 100)
  print("\n[*] Outliers Identification ")
  for feature in df.select_dtypes(include=np.number).columns:  # Identifying Outliers
    identify_outliers_by_feature(df=data, feature=feature)

# Identify Outliers

def identify_outliers_by_feature(df: pd.DataFrame, feature: str) -> None:
    """
    Identified Outliers in a variable
    :param df:
    :param feature:
    :return None:
        """
    # Calculate intrequatile range
    q25, q75 = np.percentile(df[feature], 25), np.percentile(df[feature], 75)
    iqr = q75 - q25
    print(f"Feature: {feature} \nIQR: {iqr}\nQ25: {q25}\nQ75: {q75}")
    # calculate the outlier cutoff
    cut_off = iqr * 1.5
    lower, upper = q25 - cut_off, q75 + cut_off
    print(f"CutOff: {cut_off}\nLower: {lower}\nUpper: {upper}")
    # identify outliers
    outliers = [x for x in df[feature] if x < lower or x > upper]
    print("Outliers Identified:")
    print(pd.Series(outliers).sort_values(ascending=False))
    print("==" * 10)


# Exploratory Data Analysis

**Understanding the Data**

- Overview of the dataset shape, datatypes - Statistical summary and check for missing values

In [5]:
data = pd.read_csv('CardioGoodFitness.csv')


In [7]:
first_steps_eda(data_frame=data)

[*] Data has 180 samples and 9 features.
----------------------------------------------------------------------------------------------------
[*] Features : ['Product', 'Age', 'Gender', 'Education', 'MaritalStatus', 'Usage', 'Fitness', 'Income', 'Miles']


----------------------------------------------------------------------------------------------------
[*] Missing values :

 Product          0
Age              0
Gender           0
Education        0
MaritalStatus    0
Usage            0
Fitness          0
Income           0
Miles            0
dtype: int64
----------------------------------------------------------------------------------------------------
[*] Percent of missing :

 Product         0.00000
Age             0.00000
Gender          0.00000
Education       0.00000
MaritalStatus   0.00000
Usage           0.00000
Fitness         0.00000
Income          0.00000
Miles           0.00000
dtype: float64
----------------------------------------------------------------------------

## Observations:

* 180 observations and 9 features
* Features ['Product', 'Age', 'Gender', 'Education', 'MaritalStatus', 'Usage', 'Fitness', 'Income', 'Miles']
* No missing values
* Data types int64(6), object(3)
* Unique Values
Product           3
Age              32
Gender            2
Education         8
MaritalStatus     2
Usage             6
Fitness           5
Income           62
Miles            37
* Customers Age range from 18-50, Mean Age in this dataset is 28
* Educations in years from the customers range from 12-21, Education mean is 15 years
* Usage range from 2-7 times the customer wants to use the treadmill every week, Usage mean is 3


