Apriori and FP-growth
Apriori and FP-growth are commonly used for market basket analysis, but they can be adapted to analyze real estate market trends. You can use these algorithms to discover associations between different property features or characteristics, such as the number of bedrooms, bathrooms, and location (Place).
Error optput but codes are there. 


In [None]:
# Import necessary libraries
import pandas as pd
from mlxtend.frequent_patterns import apriori, association_rules

# Load the real estate dataset
file_path = "datasets\Homes for Sale and Real Estate.csv"  # Replace with the actual path to your CSV file
df = pd.read_csv(file_path)

# Select relevant columns for analysis
property_data = df[['Beds', 'Bath', 'Place']]

# Convert categorical variables into numerical format for Apriori and FP-growth
property_data_encoded = pd.get_dummies(property_data, columns=['Beds', 'Bath', 'Place'])

# Apriori algorithm to find frequent itemsets
frequent_itemsets_apriori = apriori(property_data_encoded, min_support=0.1, use_colnames=True)

# FP-growth algorithm to find frequent itemsets
frequent_itemsets_fpgrowth = fpgrowth(property_data_encoded, min_support=0.1, use_colnames=True)

# Display the frequent itemsets
print("Frequent Itemsets using Apriori:")
print(frequent_itemsets_apriori)

print("\nFrequent Itemsets using FP-growth:")
print(frequent_itemsets_fpgrowth)

# Generate association rules using Apriori
rules_apriori = association_rules(frequent_itemsets_apriori, metric='lift', min_threshold=1.0)

# Generate association rules using FP-growth
rules_fpgrowth = association_rules(frequent_itemsets_fpgrowth, metric='lift', min_threshold=1.0)

# Display the generated association rules
print("\nAssociation Rules using Apriori:")
print(rules_apriori)

print("\nAssociation Rules using FP-growth:")
print(rules_fpgrowth)

 Chi-square Test
You can use the Chi-square test to analyze the independence or association between categorical variables, such as property location (Place) and property type (e.g., description). This can help you understand if certain property types are more common in specific areas.


In [None]:
# Import necessary libraries
import pandas as pd
from scipy.stats import chi2_contingency

# Load the dataset from a CSV file
file_path = "datasets\Homes for Sale and Real Estate.csv"  # Replace with the actual path to your CSV file
df = pd.read_csv(file_path)

# Create a contingency table (cross-tabulation) for the Chi-square test
contingency_table = pd.crosstab(df['Place'], df['Description'])

# Print the contingency table
print("Contingency Table:")
print(contingency_table)

# Perform the Chi-square test
chi2, p, _, _ = chi2_contingency(contingency_table)

# Print the results
print("\nChi-square Statistic:", chi2)
print("P-value:", p)

# Interpret the results
alpha = 0.05
print("\nSignificance level (alpha):", alpha)
print("Conclusion:")
if p < alpha:
    print("Reject the null hypothesis. There is a significant association between property location and property type.")
else:
    print("Fail to reject the null hypothesis. There is no significant association between property location and property type.")

. Data Similarity and Dissimilarity
You can measure data similarity and dissimilarity between properties using techniques like cosine similarity. This can help identify similar properties in the dataset based on their feature vectors, providing insights into property trends.


In [None]:
# Import necessary libraries
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import StandardScaler, LabelEncoder  # Add LabelEncoder here

# Load the dataset from a CSV file
file_path = "datasets\Homes for Sale and Real Estate.csv"  # Replace with the actual path to your CSV file
df = pd.read_csv(file_path)

# Drop unnecessary columns for similarity analysis
similarity_data = df[['Beds', 'Bath', 'Sq.Ft', 'Place', 'Description']]

# Convert categorical variables into numerical format for similarity analysis
label_encoder = LabelEncoder()
similarity_data['Place'] = label_encoder.fit_transform(similarity_data['Place'])
similarity_data['Description'] = label_encoder.fit_transform(similarity_data['Description'])

# Standardize the features for similarity analysis
scaler = StandardScaler()
similarity_data_scaled = scaler.fit_transform(similarity_data[['Beds', 'Bath', 'Sq.Ft', 'Place', 'Description']])

# Calculate cosine similarity matrix
cosine_sim_matrix = cosine_similarity(similarity_data_scaled, similarity_data_scaled)

# Display the cosine similarity matrix
print("Cosine Similarity Matrix:")
print(cosine_sim_matrix)

# Example: Find similar properties to a given property (e.g., first property in the dataset)
property_index = 0
similar_properties = cosine_sim_matrix[property_index]

# Display the most similar properties
print("\nMost Similar Properties to Property at Index", property_index)
similar_properties_indices = sorted(range(len(similar_properties)), key=lambda i: similar_properties[i], reverse=True)[1:6]
for index in similar_properties_indices:
    print(f"Property at Index {index}: Cosine Similarity = {similar_properties[index]}")
    print(df.loc[index, ['Address', 'Price', 'Beds', 'Bath', 'Sq.Ft', 'Place', 'Description']])
    print("\n")

Frequent Itemsets and Compact Representation
Mining frequent itemsets can reveal common combinations of property features. You can identify common property feature patterns and compactly represent these patterns to gain insights into which combinations are prevalent in the market.


In [None]:
# Import necessary libraries
import pandas as pd
from mlxtend.frequent_patterns import apriori, association_rules
from sklearn.preprocessing import LabelEncoder

# Load the dataset from a CSV file
file_path = "datasets\Homes for Sale and Real Estate.csv"  # Replace with the actual path to your CSV file
df = pd.read_csv(file_path)

# Drop unnecessary columns for frequent itemset mining
itemset_data = df[['Beds', 'Bath', 'Sq.Ft', 'Place', 'Description']]

# Convert categorical variables into numerical format for frequent itemset mining
label_encoder = LabelEncoder()
itemset_data['Place'] = label_encoder.fit_transform(itemset_data['Place'])
itemset_data['Description'] = label_encoder.fit_transform(itemset_data['Description'])

# Convert the dataset to a one-hot encoded format
one_hot_encoded = pd.get_dummies(itemset_data, columns=['Beds', 'Bath', 'Sq.Ft', 'Place', 'Description'])

# Apriori to find frequent itemsets
frequent_itemsets = apriori(one_hot_encoded, min_support=0.1, use_colnames=True)

# Generate association rules
rules = association_rules(frequent_itemsets, metric='lift', min_threshold=1.0)

# Display the generated frequent itemsets
print("Generated Frequent Itemsets:")
print(frequent_itemsets)

# Compact representation: Keep only relevant information
compact_representation = rules[['antecedents', 'consequents', 'support', 'confidence', 'lift']]

# Display the compact representation
print("\nCompact Representation of Frequent Itemsets:")
print(compact_representation)

K-means
K-means clustering can help you group properties based on their features. For example, you can cluster properties into different segments based on factors like the number of bedrooms, bathrooms, square footage, and price. This can reveal patterns in the market and identify different market segments.


In [None]:
# Import necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

# Load the dataset
file_path = "datasets\Homes for Sale and Real Estate.csv"


df = pd.read_csv(file_path)

# Select features for clustering
features = df[['Price', 'Beds', 'Bath', 'Sq.Ft']]

# Standardize the features
scaler = StandardScaler()
features_scaled = scaler.fit_transform(features)

# Determine the number of clusters (you can adjust this based on your requirements)
num_clusters = 3

# Apply K-means clustering
kmeans = KMeans(n_clusters=num_clusters, random_state=42)
df['Cluster'] = kmeans.fit_predict(features_scaled)

# Display the clustered properties
print("Clustered Properties:")
print(df[['Address', 'Price', 'Beds', 'Bath', 'Sq.Ft', 'Cluster']])

# Visualize the clusters (for 2D visualization)
plt.scatter(features_scaled[:, 0], features_scaled[:, 1], c=df['Cluster'], cmap='viridis')
plt.xlabel('Standardized Price')
plt.ylabel('Standardized Beds')
plt.title('K-means Clustering of Properties')
plt.show()

KNN (Nearest Neighbor Classifiers)
K-nearest neighbor classifiers can be applied to predict housing prices or property types based on similar properties. Given a property's features, you can find the most similar properties in the dataset and use their prices or types as predictions for the target property.


In [None]:
# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neighbors import KNeighborsRegressor
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score, mean_squared_error

# Load the dataset from a CSV file
file_path = "datasets\Homes for Sale and Real Estate.csv"  # Replace with the actual path to your CSV file
df = pd.read_csv(file_path)

# Drop unnecessary columns for KNN
knn_data = df[['Beds', 'Bath', 'Sq.Ft', 'Place', 'Description']]

# Convert categorical variables into numerical format for KNN
label_encoder = LabelEncoder()
knn_data['Place'] = label_encoder.fit_transform(knn_data['Place'])
knn_data['Description'] = label_encoder.fit_transform(knn_data['Description'])

# Separate features and target variable
X = knn_data[['Beds', 'Bath', 'Sq.Ft', 'Place', 'Description']]
y_price = df['Price']
y_type = df['Description']

# Split the dataset into training and testing sets
X_train, X_test, y_price_train, y_price_test, y_type_train, y_type_test = train_test_split(
    X, y_price, y_type, test_size=0.2, random_state=42
)

# KNN for predicting housing prices
knn_price = KNeighborsRegressor(n_neighbors=5)
knn_price.fit(X_train, y_price_train)
y_price_pred = knn_price.predict(X_test)

# Evaluate the KNN model for housing prices
mse_price = mean_squared_error(y_price_test, y_price_pred)
print("\nKNN Model for Housing Prices:")
print("Mean Squared Error:", mse_price)

# KNN for predicting property types
knn_type = KNeighborsClassifier(n_neighbors=5)
knn_type.fit(X_train, y_type_train)
y_type_pred = knn_type.predict(X_test)

# Evaluate the KNN model for property types
accuracy_type = accuracy_score(y_type_test, y_type_pred)
print("\nKNN Model for Property Types:")
print("Accuracy Score:", accuracy_type)

Data Warehouse & OLAP
You can create a data warehouse and use Online Analytical Processing (OLAP) techniques to perform multidimensional analysis of the data. This can help you drill down into different dimensions (e.g., location, property type, and price range) to gain insights into the real estate market from different angles.
Have not implemented the second dataset into it yet. 


In [None]:
# Import necessary libraries
import pandas as pd
import sqlite3
from sqlalchemy import create_engine

# Load the dataset from a CSV file
file_path = "datasets\Homes for Sale and Real Estate.csv"  # Replace with the actual path to your CSV file
df = pd.read_csv(file_path)

# Create a SQLite database and engine
db_path = "real_estate_database.db"  # Replace with the desired path for the database file
engine = create_engine(f"sqlite:///{db_path}")

# Save the DataFrame to the database
df.to_sql('real_estate', engine, index=False, if_exists='replace')

# OLAP Analysis Example: Average Price by Location and Property Type
query = """
    SELECT Place, Description, AVG(Price) as AvgPrice
    FROM real_estate
    GROUP BY Place, Description
"""

# Execute the query and fetch results
result_df = pd.read_sql(query, engine)

# Display the OLAP results
print("OLAP Analysis - Average Price by Location and Property Type:")
print(result_df)

Rule Generation
You can generate association rules to discover patterns and relationships between property features. For example, you can find rules like "In area X, properties with more bedrooms tend to have higher prices."


In [None]:
# Import necessary libraries
import pandas as pd
from mlxtend.frequent_patterns import apriori, association_rules
from sklearn.preprocessing import LabelEncoder

# Load the dataset from a CSV file
file_path = "datasets\Homes for Sale and Real Estate.csv"  # Replace with the actual path to your CSV file
df = pd.read_csv(file_path)

# Drop unnecessary columns for rule generation
rule_data = df[['Beds', 'Bath', 'Sq.Ft', 'Place', 'Description', 'Price']]

# Convert categorical variables into numerical format for rule generation
label_encoder = LabelEncoder()
rule_data['Place'] = label_encoder.fit_transform(rule_data['Place'])
rule_data['Description'] = label_encoder.fit_transform(rule_data['Description'])

# Bin the continuous variable (Price) to create categorical labels
bins = [0, 500000, 1000000, float('inf')]
labels = ['Low', 'Medium', 'High']
rule_data['Price_Category'] = pd.cut(rule_data['Price'], bins=bins, labels=labels, right=False)

# Drop the original Price column
rule_data = rule_data.drop('Price', axis=1)

# Convert the dataset to a one-hot encoded format
one_hot_encoded = pd.get_dummies(rule_data, columns=['Beds', 'Bath', 'Sq.Ft', 'Place', 'Description', 'Price_Category'])

# Apriori to find frequent itemsets
frequent_itemsets = apriori(one_hot_encoded, min_support=0.1, use_colnames=True)

# Generate association rules
rules = association_rules(frequent_itemsets, metric='lift', min_threshold=1.0)

# Display the generated association rules
print("Generated Association Rules:")
print(rules)

Simpson's Paradox
This statistical paradox can help you uncover hidden trends or patterns in the data. You can apply Simpson's Paradox to analyze how trends change when properties are grouped or split based on specific characteristics, such as location or property type.
Error optput but codes are there. 


In [None]:
# Import necessary libraries
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Load the dataset from a CSV file
file_path = "datasets\Homes for Sale and Real Estate.csv"  # Replace with the actual path to your CSV file
df = pd.read_csv(file_path)

# Create a simplified dataset for illustration
# You may need to adapt this to your actual dataset and analysis
simplified_df = df[['Place', 'Beds', 'Price']]

# Group by 'Place' and calculate average price
average_price_by_place = simplified_df.groupby('Place')['Price'].mean().reset_index()

# Plot the overall average price
plt.figure(figsize=(10, 5))
sns.barplot(x='Place', y='Price', data=average_price_by_place, color='lightblue')
plt.title('Overall Average Price by Placce')
plt.show()

# Apply Simpson's Paradox by introducing a confounding variable ('Beds')
plt.figure(figsize=(12, 6))
sns.barplot(x='Place', y='Price', hue='Beds', data=simplified_df, palette='viridis')
plt.title('Average Price by Place and Number of Bedrooms')
plt.show()

TF-IDF
You can use TF-IDF to analyze textual data, such as property descriptions. This can help identify keywords or terms that are more prevalent in certain property descriptions and gain insights into market trends based on textual data.


In [None]:
# Import necessary libraries
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

# Load the dataset from a CSV file
file_path = "datasets\Homes for Sale and Real Estate.csv"  # Replace with the actual path to your CSV file
df = pd.read_csv(file_path)

# Extract property descriptions
descriptions = df['Description']

# Create TF-IDF vectorizer
tfidf_vectorizer = TfidfVectorizer(stop_words='english', max_features=1000)

# Fit and transform the property descriptions
tfidf_matrix = tfidf_vectorizer.fit_transform(descriptions)

# Convert the TF-IDF matrix to a DataFrame
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=tfidf_vectorizer.get_feature_names_out())

# Display the TF-IDF matrix
print("TF-IDF Matrix:")
print(tfidf_df)

# Example: Display the top keywords for each property description
for i, description in enumerate(descriptions):
    print(f"\nTop keywords for Property {i + 1}: {description}")
    keywords_indices = tfidf_df.iloc[i].sort_values(ascending=False).head(5).index
    print(tfidf_vectorizer.get_feature_names_out()[keywords_indices])