
Exercises XP Ninja

Last Updated: October 16th, 2024

What you will learn

    Perform complex data cleaning and feature engineering on a real-world dataset.
    Integrate and transform multiple datasets for comprehensive analysis.
    Perform Exploring Dimensionality Reduction Techniques.


What you will create

    A cleaned and enhanced version of the New York City Airbnb dataset with missing values handled using advanced imputation techniques.
    An exploratory data analysis report showcasing the correlations between the newly created features and the target variable.
    Application of PCA for dimensionality reduction, preserving critical information.
    Application of an additional dimensionality reduction technique, such as t-SNE or LDA, with a comparison to PCA results.


Exercise 1: Advanced Data Cleaning and Feature Engineering
Instructions

Dataset: Use the New York City Airbnb Open Data.

    Load the NYC Airbnb dataset.
    Identify and handle missing values in multiple columns using advanced imputation techniques.
    Detect and treat outliers in key columns like ‘price’ and ‘number_of_reviews’.
    Create new features based on existing data, such as ‘booking_rate’ (number of reviews divided by availability) and ‘price_per_person’ (price divided by the number of accommodated guests).
    Perform exploratory data analysis to understand correlations between newly created features and the target variable.

Hint: For advanced data cleaning techniques, refer to this article on Data Cleaning.


Exercise 2: Complex Data Integration and Transformation
Instructions

Datasets: World Happiness Report and Global Health and Population Statistics

    Load both datasets.
    Perform data integration by merging the two datasets on the ‘Country’ column.
    Transform the integrated dataset by normalizing numerical columns like ‘GDP per Capita’ and ‘Life Expectancy’.
    Apply PCA for dimensionality reduction while preserving significant information.
    Conduct a comparative analysis pre- and post-transformation to evaluate the impact of these processes on the data.

Hint: For guidance on data integration and transformation, check out this article on Data Transformation.


Exercise 3 : Exploring Dimensionality Reduction Techniques
Instructions

    Read this article : A Complete Guide On Dimensionality Reduction

    Import this dataset : Shop Customer Data
    Implement Principal Component Analysis (PCA) and observe how much variance is retained with different numbers of components.
    Apply at least one more dimensionality reduction technique (like t-SNE or LDA) and compare its results with PCA.
    Visualize the results of these techniques using plots.
    Write a brief analysis of how dimensionality reduction impacted the dataset and the insights you gained from the visualizations.




In [None]:
import pandas as pd

# charger
df = pd.read_csv("AB_NYC_2019.csv")

# garder colonnes utiles
cols = ["price", "number_of_reviews", "availability_365", "reviews_per_month", "minimum_nights", "calculated_host_listings_count", "room_type", "accommodates"]
df = df[[col for col in cols if col in df.columns]]

# virer les nan importants
df = df.dropna(subset=["price", "number_of_reviews"])

# remplace nan par 0 si dispo
if "reviews_per_month" in df.columns:
    df["reviews_per_month"] = df["reviews_per_month"].fillna(0)

# virer outliers
df = df[df["price"] < 5000]

# evite division par 0
df["availability_365"] = df["availability_365"].replace(0, 1)
df["accommodates"] = df["accommodates"].replace(0, 1)

# creer booking_rate
df["booking_rate"] = df["number_of_reviews"] / df["availability_365"]

# creer price_per_person
df["price_per_person"] = df["price"] / df["accommodates"]

# correlation booking_rate avec ses colonnes sources
cor1 = df["booking_rate"].corr(df["number_of_reviews"])
cor2 = df["booking_rate"].corr(df["availability_365"])

# correlation price_per_person avec ses colonnes sources
cor3 = df["price_per_person"].corr(df["price"])
cor4 = df["price_per_person"].corr(df["accommodates"])

# afficher
print("booking_rate vs number_of_reviews :", round(cor1, 3))
print("booking_rate vs availability_365 :", round(cor2, 3))
print("price_per_person vs price :", round(cor3, 3))
print("price_per_person vs accommodates :", round(cor4, 3))