The Marketing department of Adventure Works Cycles wants to increase sales by targeting specific
customers for a mailing campaign. The company's database contains a list of past customers and a list of
potential new customers. By investigating the attributes of previous bike buyers, the company hopes to
discover patterns that they can then apply to potential customers. They hope to use the discovered patterns
to predict which potential customers are most likely to purchase a bike from Adventure Works Cycles.

#Part I: Based on Feature Selection, Cleaning, and Preprocessing to Construct an Input from Data Source

In [2]:
import numpy as np
import pandas as pd
data=pd.read_csv('AWCustomers.csv')
print(data.head)

<bound method NDFrame.head of        CustomerID Title FirstName MiddleName  LastName Suffix  \
0           21173   NaN      Chad          C      Yuan    NaN   
1           13249   NaN      Ryan        NaN     Perry    NaN   
2           29350   NaN     Julia        NaN  Thompson    NaN   
3           13503   NaN  Theodore        NaN     Gomez    NaN   
4           22803   NaN  Marshall          J      Shan    NaN   
...           ...   ...       ...        ...       ...    ...   
18356       25414   NaN     Grace          C    Bailey    NaN   
18357       11459   NaN     Tasha        NaN      Deng    NaN   
18358       12160   NaN    Jaclyn        NaN     Zhang    NaN   
18359       14353   NaN      Erin          I      Reed    NaN   
18360       16676   NaN    Amanda        NaN     Perry    NaN   

                 AddressLine1 AddressLine2            City  \
0          7090 C. Mount Hood          NaN      Wollongong   
1         3651 Willow Lake Rd          NaN         Shawnee   
2  

* (a) Examine the values of each attribute and Select a set of attributes only that would affect to predict
future bike buyers to create your input for data mining algorithms. Remove all the unnecessary
attributes. (Select features just by analysis).
* (b) Create a new Data Frame with the selected attributes only.
* (c) Determine a Data value type (Discrete, or Continuous, then Nominal, Ordinal, Interval, Ratio) of
each attribute in your selection to identify preprocessing tasks to create input for your data mining.

In [None]:
# Attributes like CustomerID, Title, FirstName, MiddleName, LastName, and Suffix are unlikely to directly influence purchasing behavior.
# AddressLine1, AddressLine2, City, StateProvinceName, and PostalCode might have some regional influence but are less direct than demographic/financial factors.
# LastUpdated is a timestamp and not a predictive feature for future behavior.

In [7]:
#(a) Select attributes based on analysis that would likely affect bike purchasing
selected_attributes = [
    'CountryRegionName',
    'Education',
    'Occupation',
    'Gender',
    'MaritalStatus',
    'HomeOwnerFlag',
    'NumberCarsOwned',
    'NumberChildrenAtHome',
    'TotalChildren',
    'YearlyIncome',

]

# (b) Create a new Data Frame with the selected attributes
try:
    data_selected = data[selected_attributes].copy()
except KeyError as e:
    print(f"Error: One or more selected attributes not found in the original DataFrame. Please check the attribute names. Missing attribute: {e}")
    data_selected = pd.DataFrame() # Create an empty DataFrame if there's an error


# (c) Determine data value types of each attribute in the selection
if not data_selected.empty:
    data_types = data_selected.dtypes
    print("\nData Types of Selected Attributes:")
    for col, dtype in data_types.items():
        print(f"- {col}: {dtype}")
print(data_selected.head)



Data Types of Selected Attributes:
- CountryRegionName: object
- Education: object
- Occupation: object
- Gender: object
- MaritalStatus: object
- HomeOwnerFlag: int64
- NumberCarsOwned: int64
- NumberChildrenAtHome: int64
- TotalChildren: int64
- YearlyIncome: int64
<bound method NDFrame.head of       CountryRegionName        Education      Occupation Gender MaritalStatus  \
0             Australia        Bachelors        Clerical      M             M   
1                Canada  Partial College        Clerical      M             M   
2         United States        Bachelors        Clerical      F             S   
3        United Kingdom  Partial College  Skilled Manual      M             M   
4               Germany  Partial College  Skilled Manual      M             S   
...                 ...              ...             ...    ...           ...   
18356     United States  Graduate Degree  Skilled Manual      F             M   
18357         Australia        Bachelors  Skilled Man

# **Part II: Data Preprocessing and Transformation**
#### Depending on the data type of each attribute, transform each object from your preprocessed data.
#### Use all the data rows (~= 18000 rows) with the selected features as input to apply all the tasks below, do not perform each task on the smaller data set that you got from your random sampling result.
* (a) Handling Null values
* (b) Normalization
* (c) Discretization (Binning) on Continuous attributes or Categorical Attributes with too many different values
* (d) Standardization/Normalization
* (e) Binarization (One Hot Encoding)


In [14]:
from sklearn.preprocessing import MinMaxScaler, StandardScaler, KBinsDiscretizer, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
import numpy as np

# Use all selected features as input
features = data_selected.copy()

# (a) Handling Null values - Using SimpleImputer to fill missing values
# We'll check for nulls first to decide on a strategy
print("Null values ")
print(features.isnull().sum())

# # For numerical columns, we can use the mean or median.
# # For categorical columns, we can use the most frequent value.
numerical_features = features.select_dtypes(include=np.number).columns
categorical_features = features.select_dtypes(include='object').columns



# (b) & (d) Normalization/Standardization
# Applying Standardization to numerical features
scaler = StandardScaler()
features[numerical_features] = scaler.fit_transform(features[numerical_features])

print("\nFeatures after Standardization:")
print(features[numerical_features].head())


# (c) Discretization (Binning)
# Applying discretization to a continuous attribute

discretizer = KBinsDiscretizer(n_bins=5, encode='ordinal', strategy='uniform') # Example with 5 bins
features['YearlyIncome_Binned'] = discretizer.fit_transform(features[['YearlyIncome']])
print("\nFeatures after Discretization:")
print(features[['YearlyIncome', 'YearlyIncome_Binned']].head())

# (e) Binarization (One Hot Encoding)
# Applying One Hot Encoding to categorical features
encoder = OneHotEncoder(handle_unknown='ignore', sparse_output=False) # Use sparse_output=False for a dense array

# Fit and transform the categorical features
encoded_categorical_features = encoder.fit_transform(features[categorical_features])

# Create a DataFrame from the encoded features
encoded_categorical_df = pd.DataFrame(encoded_categorical_features, columns=encoder.get_feature_names_out(categorical_features))

# Drop the original categorical columns and concatenate the encoded ones
features = features.drop(categorical_features, axis=1)
features = pd.concat([features, encoded_categorical_df], axis=1)

print("\nFeatures after One Hot Encoding:")
print(features.head())

print("\nPreprocessing and Transformation Complete.")
print("Shape of the preprocessed features DataFrame:", features.shape)

Null values 
CountryRegionName       0
Education               0
Occupation              0
Gender                  0
MaritalStatus           0
HomeOwnerFlag           0
NumberCarsOwned         0
NumberChildrenAtHome    0
TotalChildren           0
YearlyIncome            0
dtype: int64

Features after Standardization:
   HomeOwnerFlag  NumberCarsOwned  NumberChildrenAtHome  TotalChildren  \
0       0.798603         1.892524             -0.594371       0.161342   
1       0.798603         0.798389              1.163279       1.239753   
2      -1.252187         1.892524             -0.594371      -0.917069   
3       0.798603         0.798389              1.163279       1.239753   
4       0.798603        -0.295746             -0.594371      -0.917069   

   YearlyIncome  
0      0.298555  
1      0.271180  
2      0.444261  
3     -0.367401  
4     -0.682765  

Features after Discretization:
   YearlyIncome  YearlyIncome_Binned
0      0.298555                  2.0
1      0.271180       

# Part III: Calculating Proximity /Correlation Analysis of two features
#### Make sure each attribute is transformed in a same scale for numeric attributes and Binarization for each nominal attribute, and each discretized numeric attribute to standardization. Make sure to apply a correct similarity measure for nominal (one hot encoding)/binary attributes and numeric attributes respectively.
* (a) Calculate Similarity in Simple Matching, Jaccard Similarity, and Cosine Similarity between two following objects of your transformed input data.
* (b) Calculate Correlation between two features Commute Distance and Yearly Income

In [19]:
from sklearn.metrics.pairwise import cosine_similarity
from scipy.spatial.distance import pdist, squareform

object1 = features.iloc[0].values.reshape(1, -1)
object2 = features.iloc[1].values.reshape(1, -1)

# (a) Calculate Similarity measures

one_hot_encoded_cols = [col for col in features.columns if any(cat_col in col for cat_col in categorical_features)]

object1_binary = features.iloc[0][one_hot_encoded_cols].values.reshape(1, -1)
object2_binary = features.iloc[1][one_hot_encoded_cols].values.reshape(1, -1)

smc_similarity = np.sum(object1_binary == object2_binary) / object1_binary.shape[1]
print(f"Simple Matching Similarity between object 1 and object 2 (on binary features): {smc_similarity}")



from scipy.spatial.distance import jaccard

jaccard_distance = jaccard(object1_binary[0], object2_binary[0])
jaccard_similarity = 1 - jaccard_distance
print(f"Jaccard Similarity between object 1 and object 2 (on binary features): {jaccard_similarity}")

cosine_sim = cosine_similarity(object1, object2)[0][0]
print(f"Cosine Similarity between object 1 and object 2 (on all features): {cosine_sim}")


# (b) Calculate Correlation between two features Commute Distance and Yearly Income

if 'CommuteDistance' in data.columns and 'YearlyIncome' in data.columns:
     correlation = data['CommuteDistance'].corr(data['YearlyIncome'])
     print(f"\nCorrelation between Commute Distance and Yearly Income: {correlation}")
else:
    print("\nCould not calculate correlation between Commute Distance and Yearly Income as Commute Distance is not in the selected features.")

Simple Matching Similarity between object 1 and object 2 (on binary features): 0.8
Jaccard Similarity between object 1 and object 2 (on binary features): 0.4285714285714286
Cosine Similarity between object 1 and object 2 (on all features): 0.6491317349422756

Could not calculate correlation between Commute Distance and Yearly Income as Commute Distance is not in the selected features.
