

---

### **Key Features and Applications**
#### **Features:**
1. **Property Details**:  
   - `Location (Township, Area, State)`
   - `Tenure`  
   - `Type (e.g., Residential, Commercial, etc.)`  

2. **Market Insights**:  
   - `Median_Price`: Median price for properties in the respective category.  
   - `Median_PSF`: Median price per square foot, giving a standardized measure of property value.  
   - `Transactions`: Number of transactions, indicating market activity.

#### **Applications:**
- **Exploratory Data Analysis (EDA)**:
  - Distribution of property prices and transactions across states.
  - Comparing property types (Residential vs. Commercial).
  - Heatmaps for correlations between price, area, and transactions.

- **Predictive Modeling**:
  - **Regression Models**: Predicting median property prices based on property features.  
  - **Clustering**: Grouping similar states or areas based on property market behavior.  

- **Trend Analysis**:
  - Temporal trends in transactions and prices (if time-series data is included).  
  - Market segmentation based on affordability.

---

### **EDA Tasks**
Here’s how you can start analyzing the dataset:  
- **State-level Analysis**: 
  - Median prices and transactions across Malaysian states.
  - Distribution of property types in different states.

- **Price Distribution**:
  - Histograms or KDE plots for `Median_Price` and `Median_PSF`.  
  - Box plots by `Type` or `Tenure` for price variability.

- **Correlation Analysis**: 
  - Correlation between transactions, prices, and area.

---

### **Predictive Modeling Idea**
You can use the dataset for the following models:  
1. **Price Prediction**:  
   - Predict `Median_Price` using features like `State`, `Tenure`, `Type`, and `Area`.  
   - Use regression models like Linear Regression, Random Forest, or XGBoost.  

2. **Transaction Volume Prediction**:  
   - Predict `Transactions` to understand market activity.  
   - Use time-series forecasting if historical data is available.

---

### **Visualization Suggestions**
- **Barplots**: Comparing states based on transactions or median price.  
- **Heatmaps**: Correlation between numerical variables.  
- **Scatter Plots**: Relationship between `Median_PSF` and `Transactions`.  
- **Pie Charts**: Distribution of property types across the dataset.  
- **Choropleth Maps** (if geographical data is available): Visualize state-level trends.

---

### Example Workflow
1. **Preprocess**:
   - Handle missing data and outliers.  
   - Encode categorical variables.  

2. **EDA**:
   - Generate insights through descriptive statistics and visualizations.  

3. **Modeling**:
   - Split data into training and test sets.  
   - Train regression or classification models.  

4. **Evaluation**:
   - Compare model performance using RMSE, R², or MAE.

5. **Visualization**:
   - Summarize findings through dashboards (e.g., Power BI, Tableau, or Matplotlib/Seaborn in Python).

---



In [1]:
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

In [2]:
df = pd.read_csv('/kaggle/input/house-prices-in-malaysia-2025/malaysia_house_price_data_2025.csv')

In [3]:
df.head()

Unnamed: 0,Township,Area,State,Tenure,Type,Median_Price,Median_PSF,Transactions
0,SCIENTEX SUNGAI DUA,Tasek Gelugor,Penang,Freehold,Terrace House,331800.0,304.0,593
1,BANDAR PUTRA,Kulai,Johor,Freehold,"Cluster House, Terrace House",590900.0,322.0,519
2,TAMAN LAGENDA TROPIKA TAPAH,Chenderiang,Perak,Freehold,Terrace House,229954.0,130.0,414
3,SCIENTEX JASIN MUTIARA,Bemban,Melaka,Freehold,Terrace House,255600.0,218.0,391
4,TAMAN LAGENDA AMAN,Tapah,Perak,Leasehold,Terrace House,219300.0,168.0,363


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Township      2000 non-null   object 
 1   Area          2000 non-null   object 
 2   State         2000 non-null   object 
 3   Tenure        2000 non-null   object 
 4   Type          2000 non-null   object 
 5   Median_Price  2000 non-null   float64
 6   Median_PSF    2000 non-null   float64
 7   Transactions  2000 non-null   int64  
dtypes: float64(2), int64(1), object(5)
memory usage: 125.1+ KB


In [5]:
df.describe()

Unnamed: 0,Median_Price,Median_PSF,Transactions
count,2000.0,2000.0,2000.0
mean,490685.4,328.8625,28.0915
std,468632.2,193.281739,37.702385
min,27049.0,38.0,10.0
25%,269950.0,201.0,12.0
50%,390000.0,293.0,16.0
75%,573500.0,412.0,28.0
max,11420500.0,3017.0,593.0


In [6]:
df.isnull().sum()

Township        0
Area            0
State           0
Tenure          0
Type            0
Median_Price    0
Median_PSF      0
Transactions    0
dtype: int64

In [7]:
df.duplicated().sum()

0

In [8]:
df.shape

(2000, 8)

In [9]:
# Split categorical and numerical columns
categorical_columns = df.select_dtypes(include=['object', 'category']).columns
numerical_columns = df.select_dtypes(include=['number']).columns

# Convert to separate DataFrames if needed
categorical_df = df[categorical_columns]
numerical_df = df[numerical_columns]

# Display results
print("Categorical Columns:")
print(categorical_columns)

print("\nNumerical Columns:")
print(numerical_columns)

# Optional: Display the split DataFrames
print("\nCategorical DataFrame:")
print(categorical_df)

print("\nNumerical DataFrame:")
print(numerical_df)

Categorical Columns:
Index(['Township', 'Area', 'State', 'Tenure', 'Type'], dtype='object')

Numerical Columns:
Index(['Median_Price', 'Median_PSF', 'Transactions'], dtype='object')

Categorical DataFrame:
                         Township           Area     State     Tenure  \
0             SCIENTEX SUNGAI DUA  Tasek Gelugor    Penang   Freehold   
1                    BANDAR PUTRA          Kulai     Johor   Freehold   
2     TAMAN LAGENDA TROPIKA TAPAH    Chenderiang     Perak   Freehold   
3          SCIENTEX JASIN MUTIARA         Bemban    Melaka   Freehold   
4              TAMAN LAGENDA AMAN          Tapah     Perak  Leasehold   
...                           ...            ...       ...        ...   
1995           TAMAN ANDALAS JAYA          Klang  Selangor   Freehold   
1996   TAMAN ANJUNG BERCHAM MEGAH           Ipoh     Perak   Freehold   
1997   TAMAN PUNCAK JELAPANG MAJU           Ipoh     Perak  Leasehold   
1998         TAMAN TIONG UNG SIEW           Sibu   Sarawak  Leas

In [10]:
from sklearn.preprocessing import LabelEncoder

In [11]:
# Identify categorical columns
categorical_columns = df.select_dtypes(include=['object', 'category']).columns

# Initialize LabelEncoder
label_encoder = LabelEncoder()

# Apply label encoding to each categorical column
for col in categorical_columns:
    df[col] = label_encoder.fit_transform(df[col])

# Display the transformed DataFrame
print("DataFrame after Label Encoding:")
print(df)

DataFrame after Label Encoding:
      Township  Area  State  Tenure  Type  Median_Price  Median_PSF  \
0          763   283      8       0    31      331800.0       304.0   
1          139   149      0       0    11      590900.0       322.0   
2         1299    61      9       0    31      229954.0       130.0   
3          760    42      5       0    31      255600.0       218.0   
4         1296   282      9       2    31      219300.0       168.0   
...        ...   ...    ...     ...   ...           ...         ...   
1995       952   124     14       0    31      655000.0       410.0   
1996       960    95      9       0    31      337500.0       224.0   
1997      1497    95      9       2    35      290000.0       195.0   
1998      1739   241     13       2    31      480000.0       272.0   
1999      1132    95      9       2    31      449000.0       261.0   

      Transactions  
0              593  
1              519  
2              414  
3              391  
4         

In [None]:
plt.figure(figsize=(10, 6))
sns.histplot(data=df, kde=True, palette="muted")
plt.title("Histogram of All Features")
plt.xlabel("Values")
plt.ylabel("Frequency")
plt.show()

In [None]:
plt.figure(figsize=(8, 6))
sns.countplot(data=df, x="Type", palette="viridis")
plt.title("Count Plot of Type")
plt.show()

In [None]:
plt.figure(figsize=(8, 6))
sns.barplot(data=df, x="Type", y="Transactions", palette="coolwarm")
plt.title("Bar Plot of Transactions by Type")
plt.show()

In [None]:
plt.figure(figsize=(8, 6))
sns.boxplot(data=df, x="Type", y="Median_Price", palette="pastel")
plt.title("Box Plot of Median Price by Type")
plt.show()

In [None]:
plt.figure(figsize=(8, 6))
sns.violinplot(data=df, x="Type", y="Median_PSF", palette="magma")
plt.title("Violin Plot of Median PSF by Type")
plt.show()

In [None]:
plt.figure(figsize=(10, 6))
for column in df.columns:
    sns.kdeplot(df[column], fill=True, label=column, alpha=0.7)
plt.title("KDE Plot for All Numerical Features", fontsize=16)
plt.xlabel("Value")
plt.ylabel("Density")
plt.legend(title="Features")
plt.show()

In [None]:
plt.figure(figsize=(8, 6))
sns.scatterplot(data=df, x="Area", y="Transactions", hue="Type", palette="deep", size="Median_Price")
plt.title("Scatter Plot of Area vs Transactions")
plt.show()

In [None]:
plt.figure(figsize=(8, 6))
sns.lineplot(data=df, x="Area", y="Transactions", marker="o", color="b")
plt.title("Line Plot of Transactions over Area")
plt.show()

In [None]:
df.head()

In [None]:
plt.figure(figsize=(10, 8))
sns.pairplot(data=df, diag_kind="kde", palette="coolwarm", corner=True)
plt.suptitle("Pair Plot with KDE for Numerical Features", y=1.02, fontsize=16)
plt.show()

In [None]:
type_counts = df['Type'].value_counts()
plt.figure(figsize=(8, 8))
plt.pie(type_counts, labels=type_counts.index, autopct='%1.1f%%', colors=sns.color_palette("pastel"))
plt.title("Pie Chart of Type Distribution")
plt.show()

In [None]:
plt.figure(figsize=(10, 8))
corr_matrix = df.corr()
sns.heatmap(corr_matrix, annot=True, cmap="coolwarm", fmt=".2f")
plt.title("Heatmap of Correlation Matrix")
plt.show()

In [None]:
from sklearn.preprocessing import MinMaxScaler


In [None]:
# Identify numerical columns
numerical_columns = df.select_dtypes(include=['number']).columns

# Initialize MinMaxScaler
scaler = MinMaxScaler()

# Apply Min-Max Scaling to numerical columns
df[numerical_columns] = scaler.fit_transform(df[numerical_columns])

# Display the transformed DataFrame
print("DataFrame after Min-Max Scaling:")
print(df)

In [None]:
df.head

In [None]:
df.head()

In [None]:
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, AdaBoostRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import SVR
from xgboost import XGBRegressor

In [None]:
# Step 2: Define features (X) and target (y)
X = df.drop(columns=['Median_PSF','Transactions'])
y = df['Median_Price']


In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
# Step 5: Define regression models
models = {
    "Linear Regression": LinearRegression(),
    "Ridge Regression": Ridge(alpha=1.0),
    "Lasso Regression": Lasso(alpha=0.1),
    "Decision Tree": DecisionTreeRegressor(random_state=42),
    "Random Forest": RandomForestRegressor(random_state=42, n_estimators=100),
    "Gradient Boosting": GradientBoostingRegressor(random_state=42),
    "AdaBoost": AdaBoostRegressor(random_state=42),
    "K-Nearest Neighbors": KNeighborsRegressor(n_neighbors=5),
    "Support Vector Regressor": SVR(),
    "XGBoost": XGBRegressor(random_state=42, verbosity=0)
}

In [None]:
# Step 6: Train, predict, and evaluate each model
results = []
for name, model in models.items():
    # Train the model
    model.fit(X_train, y_train)

In [None]:
 y_pred = model.predict(X_test)

In [None]:
 y_pred

In [None]:
 # Evaluate the model
 mse = mean_squared_error(y_test, y_pred)
 mae = mean_absolute_error(y_test, y_pred)
 r2 = r2_score(y_test, y_pred)      

In [None]:
# Store the results
results.append({"Model": name, "MSE": mse, "MAE": mae, "R2": r2})
print(f"{name} Evaluation:")
print(f"Mean Squared Error: {mse}")
print(f"Mean Absolute Error: {mae}")
print(f"R2 Score: {r2}")
print("-" * 30)

In [None]:
# Step 7: Convert results to a DataFrame for easy comparison
results_df = pd.DataFrame(results)

# Step 8: Display the results
print("\nComparison of Regression Models:")
print(results_df.sort_values(by="R2", ascending=False))

In [None]:
results_melted = results_df.melt(id_vars=["Model"], value_vars=["MSE", "MAE", "R2"],
                                 var_name="Metric", value_name="Value")

In [None]:
# Step 2: Set plot style
sns.set_theme(style="whitegrid")

# Step 3: Create a bar plot for comparison
plt.figure(figsize=(12, 8))
sns.barplot(data=results_melted, x="Model", y="Value", hue="Metric", palette="viridis")


In [None]:
 
plt.title("Comparison of Regression Models Based on Test Performance Metrics", fontsize=16)
plt.xlabel("Regression Model", fontsize=14)
plt.ylabel("Performance Metric Value", fontsize=14)
plt.xticks(rotation=45, ha="right", fontsize=12)
plt.legend(title="Metric", fontsize=12)
plt.tight_layout()

# Step 5: Display the plot
plt.show()


---

## 🧑🏻‍💻 About the Author  
**Name:** Arif Mia  

🎓 **Profession:** Machine Learning Engineer & Data Scientist  

---

### 🔭 **Career Objective**  
🚀 My goal is to contribute to groundbreaking advancements in artificial intelligence and data science, empowering companies and individuals with data-driven solutions. I strive to simplify complex challenges, craft innovative projects, and pave the way for a smarter and more connected future.  

🔍 As a **Machine Learning Engineer** and **Data Scientist**, I am passionate about using machine learning, deep learning, computer vision, and advanced analytics to solve real-world problems. My expertise lies in delivering impactful solutions by leveraging cutting-edge technologies.  

---

### 💻 **Skills**  
- 🤖 **Artificial Intelligence & Machine Learning**  
- 👁️‍🗨️ **Computer Vision & Predictive Analytics**  
- 🧠 **Deep Learning & Natural Language Processing (NLP)**  
- 🐍 **Python Programming & Automation**  
- 📊 **Data Visualization & Analysis**  
- 🚀 **End-to-End Model Development & Deployment**  

---

### 🚧 **Featured Projects**  

📊 **Lung Cancer Prediction with Deep Learning**  
Achieved 99% accuracy in a computer vision project using 12,000 medical images across three classes. This project involved data preprocessing, visualization, and model training to detect cancer effectively.  

🌾 **Ghana Crop Disease Detection Challenge**  
Developed a model using annotated images to identify crop diseases with bounding boxes, addressing real-world agricultural challenges and disease mitigation.  

🛡️ **Global Plastic Waste Analysis**  
Utilized GeoPandas, Matplotlib, and machine learning models like RandomForestClassifier and CatBoostClassifier to analyze trends in plastic waste management.  

🎵 **Twitter Emotion Classification**  
Performed exploratory data analysis and built a hybrid machine learning model to classify Twitter sentiments, leveraging text data preprocessing and visualization techniques.  

---

### ⚙️ **Technical Skills**  

- 💻 **Programming Languages:** Python 🐍, SQL 🗃️, R 📈  
- 📊 **Data Visualization Tools:** Matplotlib 📉, Seaborn 🌊, Tableau 📊, Power BI 📊  
- 🧠 **Machine Learning & Deep Learning:** Scikit-learn 🤖, TensorFlow 🔥, PyTorch 🧩  
- 🗂️ **Big Data Technologies:** Hadoop 🏗️, Spark ⚡  
- 🚀 **Model Deployment:** Flask 🌐, FastAPI ⚡, Docker 🐳  

---

### 🌐 **Connect with Me**  

📧 **Email:** arifmiahcse@gmail.com 

🔗 **LinkedIn:** [www.linkedin.com/in/arif-miah-8751bb217](#)  

🐱 **GitHub:** [https://github.com/Arif-miad](#)  

📝 **Kaggle:** [https://www.kaggle.com/arifmia](#)  

🚀 Let’s turn ideas into reality! If you’re looking for innovative solutions or need collaboration on exciting projects, feel free to reach out.  

---

How does this look? Feel free to suggest changes or updates! 😊