# Topics: 
- Pandas dtypes & Conversions
- frequency distributions for categorical variables
- summary statistics, Skewness and Kurtosis
- univariate analysis: histograms, box plot
- bivariate analysis: scatter plots,  box plots, 
- Multivariate Analysis: scatter matrix, heatmap
- binning and transformation : equal-sized, custom-binning


### Pandas dtypes & Conversions

In [None]:
import pandas as pd
import numpy as np
pd.__version__


id_int64   = pd.Series([101, 102, None, 104], dtype="Int64")          # nullable integer
price_flt  = pd.Series([19.99, 25.50, np.nan, 10.00], dtype="float64")# float
name_obj   = pd.Series(["alice", "BOB", "cArOl", "dave"], dtype="object")  # object (strings)
is_member  = pd.Series([True, False, True, False], dtype="bool")      # bool
joined_at  = pd.to_datetime(["2025-09-18", "2025-09-19", None, "2025-09-21"])  # datetime64[ns]
wait_time  = pd.to_timedelta(["1 days", "0 days 06:30:00", None, "2 days 00:00:00"])  # timedelta64[ns]
tier_cat   = pd.Series(["Silver", "Gold", "Bronze", "Gold"], dtype="category")  # category

df = pd.DataFrame({
    "id": id_int64,
    "price": price_flt,
    "name": name_obj,
    "is_member": is_member,
    "joined_at": joined_at,
    "wait_time": wait_time,
    "tier": tier_cat
})


df

In [None]:
# Print the existing data types
dtypes_out = df.dtypes
print(dtypes_out)

In [None]:
# Example conversions on the existing df

# Convert 'price' column to nullable integer
# Convert to nullable float first for safer conversion
df['price_int'] = df['price'].astype('Float64').astype('Int64')
print("DataFrame after converting 'price' to Int64:")
display(df[['price', 'price_int']])

# Convert 'id' column to float
df['id_float'] = df['id'].astype('float64')
print("\nDataFrame after converting 'id' to float64:")
display(df[['id', 'id_float']])

# Convert 'is_member' to object
df['is_member_obj'] = df['is_member'].astype('object')
print("\nDataFrame after converting 'is_member' to object:")
display(df[['is_member', 'is_member_obj']])

# Convert 'tier' to object
df['tier_obj'] = df['tier'].astype('object')
print("\nDataFrame after converting 'tier' to object:")
display(df[['tier', 'tier_obj']])

# Convert 'name' to category
df['name_cat'] = df['name'].astype('category')
print("\nDataFrame after converting 'name' to category:")
display(df[['name', 'name_cat']])

# Convert 'joined_at' to object
df['joined_at_obj'] = df['joined_at'].astype('object')
print("\nDataFrame after converting 'joined_at' to object:")
display(df[['joined_at', 'joined_at_obj']])

# Convert 'wait_time' to object
df['wait_time_obj'] = df['wait_time'].astype('object')
print("\nDataFrame after converting 'wait_time' to object:")
display(df[['wait_time', 'wait_time_obj']])

# Example using pd.to_numeric on a copy of 'id' column with errors='coerce'
id_numeric_coerced = pd.to_numeric(df['id'], errors='coerce')
print("\n'id' column after using pd.to_numeric with errors='coerce':")
display(id_numeric_coerced)

## Load housing data from Kaggle into a pandas DataFrame.
## Examine the data types of the columns in the loaded DataFrame.


In [None]:
import pandas as pd

# Load the Housingkaggle.csv file into a DataFrame
df_housing = pd.read_csv('Housingkaggle.csv')

# Display the first few rows to verify loading
print("First 5 rows of Housingkaggle.csv:")
display(df_housing.head())

# Examine the data types of the columns
print("\nData types of columns in df_housing:")
display(df_housing.dtypes)

## Analyze data distribution

### Generate visualizations or statistics to understand the distribution of data in relevant columns (e.g., numerical and categorical columns).


In [None]:
# Analyze frequency distribution of categorical columns in df_housing

print("Frequency Distribution Analysis for Categorical Columns:")

# Select categorical columns from df_housing (object type in this case)
categorical_cols = df_housing.select_dtypes(include='object').columns

for col in categorical_cols:
    print(f"\n--- Frequency Distribution for '{col}' ---")
    # Calculate and display value counts
    value_counts = df_housing[col].value_counts()
    display(value_counts)

    # Interpret the results based on the value counts:
    # - The mode is the category with the highest count (first in the value_counts output)
    # - Compare the count of the mode to other categories to see if there are other frequent categories
    # - Observe how quickly the counts decrease to understand the distribution shape (rapidly decreasing, equal, etc.)
    # - Look for categories with very low counts (rare categories)

    # Optional: Calculate percentages
    # value_counts_pct = df_housing[col].value_counts(normalize=True) * 100
    # print("Frequency Distribution (%) :")
    # display(value_counts_pct)

In [None]:
# categorical variables: 
# bar chart (better than pie chart) 

import matplotlib.pyplot as plt
main_road_counts = df_housing['mainroad'].value_counts()
main_road_perc= df_housing['mainroad'].value_counts(normalize=True) * 100
main_road_perc.plot(kind='bar', color='skyblue')
plt.title('distribution of mainroad access')
plt.xlabel('Category')
plt.ylabel('Count')
plt.xticks(rotation=0)  # Keeps the category names horizontal
plt.ylim(0, 100)

In [None]:
# Calculate and display descriptive statistics for numerical columns in df_housing
print("Descriptive Statistics for Numerical Columns in df_housing:")
desc=df_housing.describe()
missing = df_housing.isna().sum().to_frame("missing_count")
display(desc); display(missing)

In [None]:
# univariate analysis 
# numerical col: show descriptvive, stats, histogram and boxplot
# descriptive stats for num col
df_housing.price.describe()
df_housing['price'].hist(grid=False)
df_housing['price'].plot(kind='box')
# alternative: 
#df_housing.boxplot(column='price')

In [None]:
import piplite
await piplite.install('seaborn')

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Test it
sns.scatterplot(x=[1, 2, 3], y=[4, 5, 6])
plt.show()

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# --- Bivariate Analysis ---

print("--- Bivariate Analysis ---")

# Scatter plot: price vs area
plt.figure(figsize=(10, 6))
sns.scatterplot(data=df_housing, x='area', y='price')
plt.title('Price vs Area')
plt.xlabel('Area')
plt.ylabel('Price')
plt.show()

# Boxplot: price vs furnishingstatus
plt.figure(figsize=(10, 6))
sns.boxplot(data=df_housing, x='furnishingstatus', y='price')
plt.title('Price vs Furnishing Status')
plt.xlabel('Furnishing Status')
plt.ylabel('Price')
plt.show()

# Boxplot: price vs stories
plt.figure(figsize=(8, 6))
sns.boxplot(data=df_housing, x='stories', y='price')
plt.title('Price vs Number of Stories')
plt.xlabel('Number of Stories')
plt.ylabel('Price')
plt.show()


In [None]:
# --- Multivariate Analysis ---

print("\n--- Multivariate Analysis ---")

# Pairplot for a subset of numerical variables to see pairwise relationships
# Selecting a few relevant numerical columns to avoid overwhelming the plot
numerical_subset = ['price', 'area', 'bedrooms', 'bathrooms']
sns.pairplot(df_housing[numerical_subset])
plt.suptitle('Pairwise Relationships of Numerical Features', y=1.02)
plt.show()

# Heatmap of correlations between numerical variables
plt.figure(figsize=(10, 8))
correlation_matrix = df_housing[numerical_subset].corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Matrix of Numerical Features')
plt.show()

# --- Histograms (revisiting for completeness, though done earlier) ---

print("\n--- Histograms (Individual Distributions) ---")

# Select numerical columns from df_housing
numerical_cols = df_housing.select_dtypes(include=np.number).columns

# Generate histograms for numerical columns
for col in numerical_cols:
    plt.figure(figsize=(8, 5))
    sns.histplot(data=df_housing, x=col, kde=True)
    plt.title(f'Distribution of {col}')
    plt.xlabel(col)
    plt.ylabel('Frequency')
    plt.show()

In [None]:
# Reiterate Skewness and Kurtosis and compare to Gaussian distribution

print("Skewness and Kurtosis compared to a Gaussian (Normal) Distribution for df_housing:")
print("A perfect Gaussian distribution has Skewness = 0 and Kurtosis = 0.")
print("In statistics, mesokurtic refers to a distribution that has a kurtosis similar to that of a normal distribution")
# Select numerical columns from df_housing
numerical_cols = df_housing.select_dtypes(include=np.number).columns

for col in numerical_cols:
    skewness = df_housing[col].skew()
    kurtosis = df_housing[col].kurtosis()

    print(f"\n--- Statistics for '{col}' ---")
    print(f"Skewness: {skewness:.4f} (Deviation from 0: {abs(skewness):.4f})")
    if abs(skewness) > 0.5:
        print("  (Significant deviation from symmetrical)")
    else:
        print("  (Relatively close to symmetrical)")

    print(f"Kurtosis: {kurtosis:.4f} (Deviation from 0: {abs(kurtosis):.4f})")
    if abs(kurtosis) > 0.5:
         print("  (Significant deviation from mesokurtic)")
    else:
        print("  (Relatively close to mesokurtic)")

# You can also visually inspect the histograms generated earlier (in cell 8ueQPZxhsvFJ)
# to see how the shapes compare to a bell curve.

In [None]:
import pandas as pd
import numpy as np

# Example using pd.cut() on the 'price' column
# Binning into 4 fixed-width intervals
print("--- Using pd.cut() on 'price' (4 fixed-width bins) ---")
df_housing['price_bin_cut'] = pd.cut(df_housing['price'], bins=4)
display(df_housing[['price', 'price_bin_cut']].head())
print("\nValue counts for price_bin_cut:")
display(df_housing['price_bin_cut'].value_counts())

# Example using pd.qcut() on the 'area' column
# Binning into 4 quantiles (approximately equal number of observations per bin)
print("\n--- Using pd.qcut() on 'area' (4 quantiles) ---")
df_housing['area_bin_qcut'] = pd.qcut(df_housing['area'], q=4)
display(df_housing[['area', 'area_bin_qcut']].head())
print("\nValue counts for area_bin_qcut:")
display(df_housing['area_bin_qcut'].value_counts())

# You can also specify custom bin edges for pd.cut()
# custom_bins = [0, 3000000, 6000000, 9000000, df_housing['price'].max()]
# df_housing['price_bin_custom'] = pd.cut(df_housing['price'], bins=custom_bins, include_lowest=True)
# print("\n--- Using pd.cut() with custom bins on 'price' ---")
# display(df_housing[['price', 'price_bin_custom']].head())
# print("\nValue counts for price_bin_custom:")
# display(df_housing['price_bin_custom'].value_counts())

## Summarize findings

### Subtask:
Provide a brief summary of the data types and distributions observed.


In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

# Select numerical columns from df_housing
numerical_cols = df_housing.select_dtypes(include=np.number).columns

# Generate histograms for numerical columns
for col in numerical_cols:
    plt.figure(figsize=(8, 5))
    sns.histplot(data=df_housing, x=col, kde=True)
    plt.title(f'Distribution of {col}')
    plt.xlabel(col)
    plt.ylabel('Frequency')
    plt.show()

# Select categorical columns from df_housing (object type in this case)
categorical_cols = df_housing.select_dtypes(include='object').columns

# Generate count plots for categorical columns
for col in categorical_cols:
    plt.figure(figsize=(8, 5))
    sns.countplot(data=df_housing, x=col, order=df_housing[col].value_counts().index) # Order by frequency
    plt.title(f'Distribution of {col}')
    plt.xlabel(col)
    plt.ylabel('Count')
    plt.xticks(rotation=45, ha='right') # Rotate labels for better readability
    plt.tight_layout() # Adjust layout to prevent labels overlapping
    plt.show()

Data Types in df_housing:
The df_housing DataFrame contains the following column data types:
- Numerical columns (int64): 'price', 'area', 'bedrooms', 'bathrooms', 'stories', 'parking'.
- Categorical columns (object): 'mainroad', 'guestroom', 'basement', 'hotwaterheating', 'airconditioning', 'prefarea', 'furnishingstatus'.


Numerical Column Distributions in df_housing:
Based on the generated histograms with KDE:
- 'price': Appears right-skewed, with most houses in the lower price range.
- 'area': Appears right-skewed, with most properties having smaller areas.
- 'bedrooms', 'bathrooms', 'stories', 'parking': These columns represent counts and show distributions with peaks at lower values, trailing off towards higher counts.


Categorical Column Distributions in df_housing:
Based on the count plots:
- 'mainroad', 'guestroom', 'basement', 'hotwaterheating', 'airconditioning', 'prefarea': These are binary (yes/no) columns. Most houses have 'mainroad' access, while 'guestroom', 'basement', 'hotwaterheating', 'airconditioning', and 'prefarea' are less common.
- 'furnishingstatus': Shows the distribution across 'furnished', 'semi-furnished', and 'unfurnished' categories.


Overall Summary of Data Types and Distributions in df_housing:
The df_housing dataset contains numerical columns ('price', 'area', 'bedrooms', 'bathrooms', 'stories', 'parking') and categorical columns ('mainroad', 'guestroom', 'basement', 'hotwaterheating', 'airconditioning', 'prefarea', 'furnishingstatus'). The numerical columns like 'price' and 'area' are right-skewed, while count-based columns show distributions peaked at lower values. The categorical columns, many of which are binary, indicate that certain features like 'mainroad' access are common, while others like 'airconditioning' or 'guestroom' are less frequent. The 'furnishingstatus' column shows the proportion of houses in different furnishing categories.
