# Load the dataset

In [0]:
import pandas as pd
import numpy as np

In [0]:
# Table name
table_name = "housing"

# Load data from the table
df = spark.read.table(table_name)

In [0]:
# Convert pyspark dataframe to pandas dataframe
df = df.toPandas()

In [0]:
df.head()

# 1) Missing value imputation

In [0]:
df.isnull().sum()

In [0]:
df.dtypes

In [0]:
df.fillna(df.mean(numeric_only=True), inplace=True)

In [0]:
df.isnull().sum()

In [0]:
df.select_dtypes(include="object").columns

In [0]:
df['ocean_proximity'] = df['ocean_proximity'].fillna(df['ocean_proximity'].mode()[0])

# 2) Outlier removal

**Outlier removal** is a process of identifying and removing or modifying data points that are considered unusual or extreme compared to the majority of the dataset

**There are several methods commonly used to remove outliers from a DataFrame. Here are a few of them:**

**1) Z-Score Method:**
- Calculate the z-score for each value in the DataFrame.
- Remove rows where any column has a z-score greater than a predefined threshold (e.g., 3).
- This method assumes that the data follows a normal distribution.

**2) IQR (Interquartile Range) Method:**
- Calculate the IQR for each column in the DataFrame.
- Remove rows where any column value is below the first quartile minus a multiple of the IQR or above the third quartile plus a multiple of the IQR (e.g., 1.5 times the IQR).
- This method is robust to non-normal distributions.

**3) Tukey's Fences Method:**
- Calculate the lower and upper fences based on the first and third quartiles and the IQR.
- Remove rows where any column value is below the lower fence or above the upper fence (e.g., 1.5 times the IQR).
- Similar to the IQR method, this approach is robust to non-normal distributions.

**4) Standard Deviation Method:**
- Calculate the mean and standard deviation for each column in the DataFrame.
- Remove rows where any column value is above or below a certain number of standard deviations from the mean (e.g., 3 standard deviations).
- This method assumes a normal distribution of the data.

**5) Percentile Method:**
- Calculate the lower and upper percentiles for each column in the DataFrame (e.g., 1st and 99th percentiles).
- Remove rows where any column value is below the lower percentile or above the upper percentile.
- This method is not distribution-specific and removes extreme values.


It's important to note that the choice of method depends on the characteristics of your data and the specific requirements of your analysis. You may need to experiment with different methods or use a combination of approaches to effectively remove outliers from your DataFrame.

## IQR (Interquartile Range) Method

In [0]:
df.dtypes

In [0]:
# Define the numerical columns
numerical_columns = ['longitude', 'latitude', 'housing_median_age', 'total_rooms',
                     'total_bedrooms', 'population', 'households', 'median_income',
                     'median_house_value']

# Calculate the first quartile (Q1) and third quartile (Q3) for each numerical column
Q1 = df[numerical_columns].quantile(0.25)
Q3 = df[numerical_columns].quantile(0.75)

# Calculate the interquartile range (IQR) for each numerical column
IQR = Q3 - Q1

# Define the lower and upper bounds for outlier detection
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Filter out the outliers from the DataFrame
df_no_outliers = df[~((df[numerical_columns] < lower_bound) | (df[numerical_columns] > upper_bound)).any(axis=1)]

In [0]:
df_no_outliers.shape

In [0]:
# Calculate the initial row count
initial_row_count = len(df)

# Calculate the row count after outlier removal
final_row_count = len(df_no_outliers)

# Calculate the number of removed rows
removed_rows = initial_row_count - final_row_count

# Display the number of removed rows
print("Number of removed rows:", removed_rows)

## Z-Score Method

In [0]:
from scipy import stats

# Define the numerical columns
numerical_columns = ['longitude', 'latitude', 'housing_median_age', 'total_rooms',
                     'total_bedrooms', 'population', 'households', 'median_income',
                     'median_house_value']

# Calculate z-scores for numerical columns
z_scores = stats.zscore(df[numerical_columns])

# Define the threshold for outlier detection
threshold = 3

# Filter out the outliers from the DataFrame
df_no_outliers = df[(z_scores < threshold).all(axis=1)]

In [0]:
df_no_outliers.shape

In [0]:
# Calculate the initial row count
initial_row_count = len(df)

# Calculate the row count after outlier removal
final_row_count = len(df_no_outliers)

# Calculate the number of removed rows
removed_rows = initial_row_count - final_row_count

# Display the number of removed rows
print("Number of removed rows:", removed_rows)

# 3) Feature Creation

In [0]:
df_no_outliers.shape

In [0]:
df.shape

In [0]:
df = df_no_outliers

In [0]:
df.shape

In [0]:
df.head()

In [0]:
df["housing_median_age_days"] = df["housing_median_age"] * 365

In [0]:
df.head()

In [0]:
df = df.drop(columns="housing_median_age_days")

In [0]:
df.head()

# 4) Feature Scaling

- **Feature scaling, also known as data normalization:** The process of transforming numerical features in a dataset to a common scale. It is a crucial step in data preprocessing and feature engineering, as it helps to bring the features to a similar range and magnitude. The goal of feature scaling is to ensure that no single feature dominates the learning process or introduces bias due to its larger values

- **There are two common methods for feature scaling:**

**1) Standardization (Z-score normalization):** In this method, each feature is transformed to have zero mean and unit variance. The formula for standardization is: x_scaled = (x - mean) / standard_deviation.
Standardization ensures that the transformed feature has a mean of 0 and a standard deviation of 1.

**2) Min-Max scaling:** In this method, each feature is scaled to a specific range, typically between 0 and 1.
The formula for min-max scaling is: x_scaled = (x - min) / (max - min).
Min-max scaling preserves the relative ordering of values and ensures that the transformed feature is bounded within the defined range.


**Feature scaling is important for several reasons:**

1) Gradient-based optimization algorithms, such as gradient descent, converge faster when features are on a similar scale. This helps in achieving faster convergence and more efficient training of machine learning models.

2) Features with larger scales can dominate the learning process, leading to biased results. Scaling the features ensures that no single feature has undue influence on the model.

3) Many machine learning algorithms, such as K-nearest neighbors (KNN) and support vector machines (SVM), rely on calculating distances between data points. If features are not on a similar scale, features with larger values can dominate the distance calculations, leading to suboptimal results.

4) Some algorithms, such as principal component analysis (PCA), assume that the data is centered and on a similar scale. Feature scaling is necessary to meet these assumptions and obtain meaningful results.

In [0]:
df.columns

In [0]:
numerical_features = ['longitude', 'latitude', 'housing_median_age', 'total_rooms',
       'total_bedrooms', 'population', 'households', 'median_income',
       'median_house_value']

In [0]:
print(numerical_features)

In [0]:
df.head()

In [0]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df[numerical_features] = scaler.fit_transform(df[numerical_features])

In [0]:
df.head()

# 5) One-hot-encoding (Feature Encoding)

One-hot encoding, also known as feature encoding, is a technique used to convert categorical variables into a numerical representation that can be used by machine learning algorithms. It is a common preprocessing step in machine learning tasks that involve categorical features.

Categorical variables are variables that represent qualitative or discrete characteristics or groups. Examples of categorical variables include "color" (red, green, blue), "city" (New York, London, Paris), or "animal" (cat, dog, bird).

**The benefits of one-hot encoding include:**

1) Compatibility with machine learning algorithms: Many machine learning algorithms require numerical input. By converting categorical variables into a numerical format, one-hot encoding enables the use of these variables in machine learning models.

2) Preserving information: One-hot encoding preserves the information about the presence or absence of specific categories in the original data, which can be valuable for certain models.

In [0]:
df.select_dtypes(include="object").columns

In [0]:
df['ocean_proximity'].unique()

In [0]:
df['ocean_proximity'].nunique()

In [0]:
df.head()

In [0]:
df.shape

In [0]:
df = pd.get_dummies(data=df, drop_first=True)

In [0]:
df.head()

In [0]:
df.shape

# 6) Feature Selection


Feature selection is a crucial step in feature engineering, where the goal is to identify and select a subset of relevant features from the available set of features in a dataset. The aim is to improve model performance, reduce overfitting, enhance interpretability, and reduce computational complexity.

- **Benefits of feature selection include:**

1) Improved model performance: By selecting relevant features, feature selection can enhance model accuracy, reduce overfitting, and improve generalization on unseen data.

2) Faster model training: Fewer features can lead to faster training times, especially when dealing with large datasets or complex models.

3) Enhanced interpretability: Selecting a subset of meaningful features can improve the interpretability of the model, allowing for better understanding and insights.

4) Reduced dimensionality: By eliminating irrelevant or redundant features, feature selection can reduce the dimensionality of the dataset, making it more manageable and reducing the risk of the curse of dimensionality.

In [0]:
df.head()

In [0]:
X = df.drop("median_house_value", axis=1)
y = df["median_house_value"]

In [0]:
X.head()

In [0]:
y.head()

In [0]:
type(y)

In [0]:
from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression

# Create the feature selection model (linear regression estimator and 5 features to select)
estimator = LinearRegression()
rfe = RFE(estimator, n_features_to_select=5)

# Fit the feature selection model on the data
rfe.fit(X, y)

# Get the selected features
selected_features = X.columns[rfe.support_]
print(selected_features)

# 7) Feature Transformation (if needed)

- The process of applying mathematical or statistical transformations to the existing features in a dataset to make them more suitable for a machine learning algorithm or to reveal underlying patterns in the data.
- Feature transformation techniques aim to improve the quality and representativeness of the features, which can lead to better model performance and more meaningful insights.

In [0]:
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from scipy.stats import boxcox

# Example dataset
data = pd.DataFrame({
    'feature1': [10, 20, 30, 40, 50],
    'feature2': [0.1, 1, 10, 100, 1000],
    'feature3': [100, 200, 300, 400, 500]
})

In [0]:
print(data)

In [0]:
# Normalization
scaler = MinMaxScaler()
normalized_data = scaler.fit_transform(data)

# Standardization
scaler = StandardScaler()
standardized_data = scaler.fit_transform(data)

# Logarithmic Transformation
log_transformed_feature = np.log(data['feature1'])

# Power Transformation
power_transformed_feature = np.sqrt(data['feature2'])

# Box-Cox Transformation
boxcox_transformed_feature, _ = boxcox(data['feature3'])

# Binning
bin_edges = [0, 20, 40, 60]
binned_feature = pd.cut(data['feature1'], bins=bin_edges, labels=False)

# Polynomial Transformation
polynomial_features = pd.DataFrame({
    'feature1_squared': data['feature1'] ** 2,
    'feature1_cubed': data['feature1'] ** 3
})

# Interaction Terms
interaction_feature = data['feature1'] * data['feature2']

In [0]:
# Print the transformed features
print("Normalized data:")
print(normalized_data)

print("\nStandardized data:")
print(standardized_data)

print("\nLogarithmic transformed feature:")
print(log_transformed_feature)

print("\nPower transformed feature:")
print(power_transformed_feature)

print("\nBox-Cox transformed feature:")
print(boxcox_transformed_feature)

print("\nBinned feature:")
print(binned_feature)

print("\nPolynomial features:")
print(polynomial_features)

print("\nInteraction feature:")
print(interaction_feature)

# 8) Dimensionality Reduction (if needed)

- The process of reducing the number of features or variables in a dataset while preserving the essential information
- **Aims to overcome,**
1) The curse of dimensionality
2) Improve computational efficiency
3) Eliminate noise or redundant features
4) Potentially enhance the performance of ML models

- High-dimensional data can lead to several challenges, such as increased computational complexity, overfitting, and difficulty in interpreting and visualizing the data
- Dimensionality reduction techniques address these challenges by transforming or projecting the data into a lower-dimensional space, where the most relevant information is retained.

**There are two main approaches to dimensionality reduction:**

**1) Feature Selection:** This approach involves selecting a subset of the original features based on certain criteria. It aims to identify the most informative and relevant features that contribute significantly to the target variable or capture the underlying patterns in the data. Feature selection methods can be filter-based (e.g., correlation, statistical tests) or wrapper-based (e.g., recursive feature elimination, forward/backward feature selection).

**2) Feature Extraction:** This approach involves transforming the original features into a new set of lower-dimensional features. It aims to create a compressed representation of the data by combining or projecting the original features into a new feature space. Principal Component Analysis (PCA) and Linear Discriminant Analysis (LDA) are popular feature extraction techniques. Other methods include Non-negative Matrix Factorization (NMF), t-SNE, and Autoencoders.

**Benefits of dimensionality reduction:**

1) Computational Efficiency: By reducing the number of features, the computational complexity of algorithms decreases, resulting in faster training and inference times.

2) Overfitting Prevention: Dimensionality reduction helps to remove noisy or irrelevant features, reducing the risk of overfitting and improving the generalization capability of models.

3) Improved Visualization: Lower-dimensional data can be visualized more easily, enabling better understanding and interpretation of the data.

4) Enhanced Model Performance: By focusing on the most relevant features, dimensionality reduction can improve the performance of machine learning models by reducing noise, capturing important patterns, and avoiding the curse of dimensionality.

**Principal Component Analysis (PCA)** is a widely used technique for dimensionality reduction and feature extraction in data analysis and machine learning. It aims to transform a high-dimensional dataset into a lower-dimensional space while preserving the most important patterns and variations in the data

In [0]:
from sklearn.decomposition import PCA

# Create the PCA model
pca = PCA(n_components=2)

# Fit the PCA model to X
pca.fit(X)

# Transform X to the new feature space
X_reduced = pca.transform(X)

# Print the shape of X_reduced
print(X_reduced.shape)

In [0]:
# Print the number of principal components
print(pca.n_components_)

In [0]:
print(X_reduced)