# Assignment 2 
**Name: Niyati** 
**Roll No: 102303356** 

In [5]:
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.metrics.pairwise import cosine_similarity
from scipy.spatial.distance import jaccard
import pandas as pd
import numpy as np

## Part I: Based on Feature Selection, Cleaning, and Preprocessing to Construct an Input from Data Source 

#### (a) Examine the values of each attribute and Select a set of attributes only that would affect to predict future bike buyers to create your input for data mining algorithms. Remove all the unnecessary attributes. (Select features just by analysis).

We keep the following attributes because they influence the likelihood of bike purchase:  
- Age (derived from BirthDate)  
- Education  
- Occupation  
- Gender  
- MaritalStatus  
- HomeOwnerFlag  
- NumberCarsOwned  
- NumberChildrenAtHome  
- TotalChildren  
- YearlyIncome  

We drop identifiers (CustomerID, names, addresses, phone, postal code, etc.) as they don’t affect prediction.

In [6]:
customers = pd.read_csv("AWCustomers.csv")
sales = pd.read_csv("AWSales.csv")

df = pd.merge(customers, sales, on="CustomerID")

df['BirthDate'] = pd.to_datetime(df['BirthDate'], errors='coerce')
df['Age'] = (pd.to_datetime("today") - df['BirthDate']).dt.days // 365

features = ["Age","Education","Occupation","Gender","MaritalStatus","HomeOwnerFlag",
            "NumberCarsOwned","NumberChildrenAtHome","TotalChildren","YearlyIncome"]

df = df[features + ["BikeBuyer"]]

#### (b) Create a new Data Frame with the selected attributes only. 

The reduced DataFrame now contains only the chosen predictors along with the target variable BikeBuyer.

In [7]:
df.head()

Unnamed: 0,Age,Education,Occupation,Gender,MaritalStatus,HomeOwnerFlag,NumberCarsOwned,NumberChildrenAtHome,TotalChildren,YearlyIncome,BikeBuyer
0,37,Bachelors,Clerical,M,M,1,3,0,1,81916,1
1,53,Partial College,Clerical,M,M,1,2,1,2,81076,1
2,39,Bachelors,Clerical,F,S,0,3,0,0,86387,1
3,47,Partial College,Skilled Manual,M,M,1,2,1,2,61481,1
4,50,Partial College,Skilled Manual,M,S,1,1,0,0,51804,1


#### (c) Determine a Data value type (Discrete, or Continuous, then Nominal, Ordinal, Interval, Ratio) of each attribute in your selection to identify preprocessing tasks to create input for your data mining. 

- Age → Continuous, Ratio  
- Education → Discrete, Ordinal  
- Occupation → Categorical, Nominal  
- Gender → Categorical, Nominal (binary)  
- MaritalStatus → Categorical, Nominal (binary)  
- HomeOwnerFlag → Binary, Nominal  
- NumberCarsOwned → Discrete, Ratio  
- NumberChildrenAtHome → Discrete, Ratio  
- TotalChildren → Discrete, Ratio  
- YearlyIncome → Continuous, Ratio  
- BikeBuyer (target) → Binary, Nominal

## Part II: Data Preprocessing and Transformation 

### Depending on the data type of each attribute, transform each object from your preprocessed data. 

#### Use all the data rows (~= 18000 rows) with the selected features as input to apply all the tasks below, do not perform each task on the smaller data set that you got from your random sampling result.

#### (a) Handling Null values 

In [None]:
print(df.isnull().sum())

df = df.fillna({
    'YearlyIncome': df['YearlyIncome'].median(),
    'Education': df['Education'].mode()[0],
    'Occupation': df['Occupation'].mode()[0],
    'Age': df['Age'].median()
})

print(df.isnull().sum())


Age                     0
Education               0
Occupation              0
Gender                  0
MaritalStatus           0
HomeOwnerFlag           0
NumberCarsOwned         0
NumberChildrenAtHome    0
TotalChildren           0
YearlyIncome            0
BikeBuyer               0
dtype: int64
Age                     0
Education               0
Occupation              0
Gender                  0
MaritalStatus           0
HomeOwnerFlag           0
NumberCarsOwned         0
NumberChildrenAtHome    0
TotalChildren           0
YearlyIncome            0
BikeBuyer               0
dtype: int64


#### (b) Normalization  

In [9]:
minmax = MinMaxScaler()
df[['Age','YearlyIncome']] = minmax.fit_transform(df[['Age','YearlyIncome']])

#### (c) Discretization (Binning) on Continuous attributes or Categorical Attributes with too many different values  


In [10]:
df['IncomeBin'] = pd.qcut(df['YearlyIncome'], q=4, labels=["Low","Medium","High","Very High"])
df[['YearlyIncome','IncomeBin']].head()

Unnamed: 0,YearlyIncome,IncomeBin
0,0.496842,High
1,0.489453,High
2,0.536172,High
3,0.317083,Medium
4,0.231958,Low


#### (d) Standardization/Normalization 

In [11]:
scaler = StandardScaler()
df[['Age','YearlyIncome']] = scaler.fit_transform(df[['Age','YearlyIncome']])

#### (e) Binarization (One Hot Encoding) 

In [12]:
df = pd.get_dummies(df, columns=['Education','Occupation','Gender','MaritalStatus','IncomeBin'],
                    drop_first=True)
df.head()

Unnamed: 0,Age,HomeOwnerFlag,NumberCarsOwned,NumberChildrenAtHome,TotalChildren,YearlyIncome,BikeBuyer,Education_Graduate Degree,Education_High School,Education_Partial College,Education_Partial High School,Occupation_Management,Occupation_Manual,Occupation_Professional,Occupation_Skilled Manual,Gender_M,MaritalStatus_S,IncomeBin_Medium,IncomeBin_High,IncomeBin_Very High
0,-0.542546,1,3,0,1,0.298555,1,False,False,False,False,False,False,False,False,True,False,False,True,False
1,0.877383,1,2,1,2,0.27118,1,False,False,True,False,False,False,False,False,True,False,False,True,False
2,-0.365055,0,3,0,0,0.444261,1,False,False,False,False,False,False,False,False,False,True,False,True,False
3,0.34491,1,2,1,2,-0.367401,1,False,False,True,False,False,False,False,True,True,False,True,False,False
4,0.611147,1,1,0,0,-0.682765,1,False,False,True,False,False,False,False,True,True,True,False,False,False


## Part III: Calculating Proximity /Correlation Analysis of two features 

### Make sure each attribute is transformed in a same scale for numeric attributes and Binarization for each nominal attribute, and each discretized numeric attribute to standardization. Make sure to apply a correct similarity measure for nominal (one hot encoding)/binary attributes and numeric attributes respectively. 

#### (a) Calculate Similarity in Simple Matching, Jaccard Similarity, and Cosine Similarity between two following objects of your transformed input data. 

In [13]:
x = df.drop(columns=['BikeBuyer']).iloc[0]
y = df.drop(columns=['BikeBuyer']).iloc[1]

x_arr, y_arr = x.values, y.values

# Cosine
cos_sim = cosine_similarity([x_arr],[y_arr])[0][0]

# Jaccard (binary mask)
bin_x, bin_y = x_arr.astype(bool), y_arr.astype(bool)
jaccard_sim = 1 - jaccard(bin_x, bin_y)

# Simple Matching
simple_matching = np.sum(x_arr == y_arr) / len(x_arr)

print("Cosine Similarity:", cos_sim)
print("Jaccard Similarity:", jaccard_sim)
print("Simple Matching:", simple_matching)

Cosine Similarity: 0.7791177379990969
Jaccard Similarity: 0.7777777777777778
Simple Matching: 0.6842105263157895


#### (b) Calculate Correlation between two features Commute Distance and Yearly Income

In [15]:
if 'CommuteDistance' in customers.columns:
    mapping = {"0-1 Miles":1, "1-2 Miles":2, "2-5 Miles":3, "5-10 Miles":4, "10+ Miles":5}
    customers['CommuteDistance'] = customers['CommuteDistance'].map(mapping)

    df_corr = pd.merge(customers[['CustomerID','CommuteDistance']], sales, on="CustomerID")
    df_corr = pd.merge(df_corr, customers[['CustomerID','YearlyIncome']], on="CustomerID")

    print(df_corr[['CommuteDistance','YearlyIncome']].corr(method='pearson'))
else:
    print("CommuteDistance not available in dataset.")


CommuteDistance not available in dataset.
