# Practical Example: Descriptive Statistics

This notebook contains the solutions to the practical examples from the Udemy course.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Set style
sns.set(style='whitegrid')

# Load data
file_path = '2.13.Practical-example.Descriptive-statistics-exercise-solution.xlsx'
df = pd.read_excel(file_path, sheet_name='365RE', header=4)

# Display first few rows
df.head()

## Task 1
**What are the types of variables and their levels of measurement?**

| Variable | Type of data | Level of measurement | Comment |
|---|---|---|---|
| ID | Categorical | Nominal | Unique identifier |
| Building | Categorical | Nominal | Building identifier |
| Year of sale | Numerical | Interval | Year of transaction |
| Month of sale | Numerical | Interval | Month of transaction |
| Type of property | Categorical | Nominal | Apartment, Office, etc. |
| Property # | Categorical | Nominal | Property identifier |
| Area (ft.) | Numerical | Ratio | Area in square feet |
| Price | Numerical | Ratio | Price in USD |
| Status | Categorical | Nominal | Sold or Not Sold |
| Customer ID | Categorical | Nominal | Customer identifier |
| Entity | Categorical | Nominal | Individual or Firm |
| Name | Categorical | Nominal | Customer Name |
| Surname | Categorical | Nominal | Customer Surname |
| Age at time of purchase | Numerical | Ratio | Age in years |
| Interval | Numerical | Ratio | Time interval |
| Y | Numerical | Ratio | Years |
| M | Numerical | Ratio | Months |
| D | Numerical | Ratio | Days |
| Gender | Categorical | Nominal | Gender |
| Country | Categorical | Nominal | Country of origin |
| State | Categorical | Nominal | State |
| Purpose | Categorical | Nominal | Purpose of purchase |
| Deal satisfaction | Categorical | Ordinal | Satisfaction rating (1-5) |
| Mortgage | Categorical | Nominal | Yes/No |
| Source | Categorical | Nominal | Source of lead |

## Task 2
**Create a frequency distribution graph (histogram) of Price with the highest possible number of bins (267).**

In [None]:
plt.figure(figsize=(12, 6))
plt.hist(df['Price'], bins=267, color='skyblue', edgecolor='black')
plt.title('Histogram of Price (267 bins)')
plt.xlabel('Price')
plt.ylabel('Frequency')
plt.show()

## Task 3
**Create a histogram of Price with bin width of $100,000.**

In [None]:
# Calculate bins
min_price = df['Price'].min()
max_price = df['Price'].max()
bins = np.arange(min_price, max_price + 100000, 100000)

plt.figure(figsize=(12, 6))
plt.hist(df['Price'], bins=bins, color='salmon', edgecolor='black')
plt.title('Histogram of Price (Bin width $100,000)')
plt.xlabel('Price')
plt.ylabel('Frequency')
plt.xticks(bins, rotation=45)
plt.show()

## Task 4
**Interpret the results.**

The histogram shows that the distribution of property prices is right-skewed (positively skewed). Most properties are priced between $200,000 and $300,000. There are a few very expensive properties (outliers) that extend the tail to the right.

## Task 5
**Create a scatter plot showing the relationship between Price and Area.**

In [None]:
plt.figure(figsize=(10, 6))
sns.scatterplot(x='Area (ft.)', y='Price', data=df, alpha=0.6)
plt.title('Scatter Plot: Price vs Area')
plt.xlabel('Area (ft.)')
plt.ylabel('Price')
plt.show()

**Interpretation:** There is a strong positive correlation between Price and Area. As the area of the property increases, the price tends to increase as well.

## Task 6
**Create a frequency distribution table for Country.**

In [None]:
# Calculate frequencies
country_counts = df['Country'].value_counts().reset_index()
country_counts.columns = ['Country', 'Frequency']

# Calculate relative frequency
country_counts['Relative Frequency'] = country_counts['Frequency'] / country_counts['Frequency'].sum()

# Calculate cumulative frequency
country_counts['Cumulative Frequency'] = country_counts['Relative Frequency'].cumsum()

country_counts

## Task 7
**Create a Pareto diagram for Country.**

In [None]:
fig, ax1 = plt.subplots(figsize=(12, 6))

# Bar plot for Frequency
sns.barplot(x='Country', y='Frequency', data=country_counts, ax=ax1, color='steelblue')
ax1.set_ylabel('Frequency')
ax1.set_xlabel('Country')

# Line plot for Cumulative Frequency
ax2 = ax1.twinx()
sns.lineplot(x='Country', y='Cumulative Frequency', data=country_counts, ax=ax2, color='red', marker='o', sort=False)
ax2.set_ylabel('Cumulative Frequency')
ax2.set_ylim(0, 1.1)

plt.title('Pareto Diagram: Country')
plt.show()

## Task 8
**Calculate Mean, Median, Mode, Skewness, Variance, and Standard Deviation of Price.**

In [None]:
price_stats = {
    'Mean': df['Price'].mean(),
    'Median': df['Price'].median(),
    'Mode': df['Price'].mode()[0],
    'Skewness': df['Price'].skew(),
    'Variance': df['Price'].var(),
    'Standard Deviation': df['Price'].std()
}

for stat, value in price_stats.items():
    print(f"{stat}: {value:.2f}")

## Task 9
**Interpret the measures.**

- **Mean vs Median**: The mean is higher than the median, which confirms the positive skewness (right skew).
- **Skewness**: A positive skewness value (> 0) indicates a tail on the right side.
- **Standard Deviation**: Indicates the spread of prices around the mean.

## Task 10
**Calculate Covariance and Correlation between Price and Area.**

In [None]:
covariance = df['Price'].cov(df['Area (ft.)'])
correlation = df['Price'].corr(df['Area (ft.)'])

print(f"Covariance: {covariance:.2f}")
print(f"Correlation: {correlation:.4f}")

**Interpretation:** The correlation coefficient is close to 1, indicating a very strong positive linear relationship between Price and Area, which is consistent with the scatter plot.