# Project 1: Data Preprocessing

## Overview

### About the data set
This homework aims to give you hands-on experience in various essential data preprocessing techniques. Using the **California Housing Dataset**, you will practice calculating descriptive statistics, cleaning data, normalizing and discretizing data, visualizing distributions, and calculating dissimilarity matrices.

### Objectives

- Practice calculating descriptive statistics, handling missing values, and detecting duplicates.
- Learn how to normalize and discretize attributes.
- Visualize data distributions using histograms, box plots, and scatter plots.
- Compute dissimilarity matrices for nominal, ordinal, and mixed-type attributes.

### The attributes in the dataset:
1. **longitude**: Longitude coordinate of the block where the house is located.
2. **latitude**: Latitude coordinate of the block where the house is located.
3. **housingMedianAge**: Median age of houses within a block (years).
4. **totalRooms**: Total number of rooms within a block.
5. **totalBedrooms**: Total number of bedrooms within a block.
6. **population**: Total population of a block.
7. **households**: Total number of households within a block.
8. **medianIncome**: Median income for households in the block (in tens of thousands of dollars).
9. **medianHouseValue**: Median house value for households in a block (in US dollars).
10. **oceanProximity**: Proximity of the block to the ocean.






## Import and Setting

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
%matplotlib inline
from matplotlib.pylab import rcParams

In [None]:
# Set the default figure size for matplotlib plots to 15 inches wide by 6 inches tall
rcParams["figure.figsize"] = (15, 6)

# Increase the default font size of the titles in matplotlib plots to extra-extra-large
rcParams["axes.titlesize"] = "xx-large"

# Make the titles of axes in matplotlib plots bold for better visibility
rcParams["axes.titleweight"] = "bold"

# Set the default location of the legend in matplotlib plots to the upper left corner
rcParams["legend.loc"] = "upper left"

# Configure pandas to display all columns of a DataFrame when printed to the console
pd.set_option('display.max_columns', None)

# Configure pandas to display all rows of a DataFrame when printed to the console
pd.set_option('display.max_rows', None)

### Data Loading
Load the California housing dataset as a data frame `df`

In [None]:
url = "https://gvsu-cis635.github.io/_downloads/cbdbd448a2884edab50a2bc50eb89749/housing.csv"
df = pd.read_csv(url)
print('Number of instances = %d' % (df.shape[0]))
print('Number of attributes = %d' % (df.shape[1]))
display(df.info())
display (df.head(n=10))

Number of instances = 20640
Number of attributes = 10
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   longitude           20640 non-null  float64
 1   latitude            20640 non-null  float64
 2   housing_median_age  20640 non-null  float64
 3   total_rooms         20640 non-null  float64
 4   total_bedrooms      20433 non-null  float64
 5   population          20640 non-null  float64
 6   households          20640 non-null  float64
 7   median_income       20640 non-null  float64
 8   median_house_value  20640 non-null  float64
 9   ocean_proximity     20640 non-null  object 
dtypes: float64(9), object(1)
memory usage: 1.6+ MB


None

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY
5,-122.25,37.85,52.0,919.0,213.0,413.0,193.0,4.0368,269700.0,NEAR BAY
6,-122.25,37.84,52.0,2535.0,489.0,1094.0,514.0,3.6591,299200.0,NEAR BAY
7,-122.25,37.84,52.0,3104.0,687.0,1157.0,647.0,3.12,241400.0,NEAR BAY
8,-122.26,37.84,42.0,2555.0,665.0,1206.0,595.0,2.0804,226700.0,NEAR BAY
9,-122.25,37.84,52.0,3549.0,707.0,1551.0,714.0,3.6912,261100.0,NEAR BAY


## Question 1. Descriptive Statistics
Calculate and display the mean, median, maximum, minimum, standard deviation, and Interquartile Range (IQR) for all the **Numeric** attributes in the dataframe `df`.

In [1]:
# your answer

### Question 2: Data Cleaning

1. **Missing Values**: First, check if there are any samples with missing data. If found, display the affected samples and fill the missing values with the attribute's average. If no missing values exist, no further action is needed.

2. **Duplicate Detection**: Check for any duplicate records. If duplicates are found, display and remove all but one instance of each duplicate; otherwise, no further steps are required.

In [2]:
# your answer

### Question 3: Data Normalization
Normalize all numerical attributes using **Z-score normalization** and save the result as `df_normalized`.

In [3]:
# your answer

### Question 4: Data Discretization
Create two new features, `ew_median_house_value` and `ed_median_house_value`, in the `df` dataframe by transforming the `median_house_value` attribute into a discrete categorical feature with three categories (low, medium, high) using the following methods:
1. **Equal-width binning**: Divide the range of `median_house_value` into intervals of equal size and create `ew_median_house_value`.
2. **Equal-depth binning**: Distribute the `median_house_value` values into bins so that each bin contains roughly the same number of records, creating `ed_median_house_value`.

In [4]:
# your answer

### Question 5: Data Visualization
Visualize the `median_house_value` and `housing_median_age` attributes in `df_normalized` using:
- Box plots to examine the distributions of `median_house_value` and `housing_median_age`.
- Histograms (using 50 bins) to observe the frequency distributions of `median_house_value` and `housing_median_age`.
- Scatter plot to explore the relationship between `median_house_value` and `housing_median_age`.

In [5]:
# your answer

### Question 6: Dissimilarity Matrix for Mixed-Type Attributes
Calculate the **dissimilarity matrix** for **the first 10 data samples** for the following attributes: **`housing_median_age`, `total_rooms`, `total_bedrooms`, `population`, `households`, `median_income`, `ew_median_house_value`(Ordinal), `ocean_proximity`(Nominal)**.

The dissimilarity $d(i,j)$ between objects $i$ and $j$ is defined as:

$$d(i,j)=\frac{\sum^p_{f=1}δ^f_{ij}d^f_{ij}}{\sum^p_{f=1}δ^f_{ij}}$$

where the indicator of attribute $f$, $δ^f_{ij}=0$ in the following cases:
1. If $x_i^f$ or $x_j^f$ is missing.
2. If $x_i^f = x_j^f = 0$ and the attribute $f$ is asymmetric binary.
3. Otherwise, $δ^f_{ij}=1$.

For each attribute type:
- **Numeric**: $d_{ij}^{f} = \frac{|x_i^f-x_j^f|}{\text{max}_f-\text{min}_f}$.
- **Nominal/Binary**: $d_{ij}^f = 0$ if $x_i^f = x_j^f$; otherwise, $d_{ij}^f = 1$.
- **Ordinal**: Suppose that $f$ is an ordinal attribute and has $M_f$ ordered states. Let $1, . . . , M_f$ represent ranking of these ordered states. The dissimilarity of $f$ can be calculated by: normalize the rank $r_i^f$ of the object $i$ and attribute $f$ by $z_i^f = \frac{r_i^f-1}{M_f-1}$, and then compute the dissimilarity using **Euclidean distance**  

In [6]:
# your answer