<a href="https://colab.research.google.com/github/DeemaEssam/BMIcalculator/blob/main/Copy_of_phase1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

*Python code snippet that demonstrates the process of feature reduction in machine learning using a real-world dataset. Feature reduction is a crucial step in machine learning pipelines aimed at improving model performance, reducing overfitting, and enhancing interpretability by selecting only the most important features for predictive modeling.*

In [None]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA
from sklearn.metrics import mean_squared_error




*   The code starts by loading a dataset named 'SA_Aqar.csv', assumed to contain real estate property information, using the pandas library.
*   Categorical columns ('city', 'district', 'front') are converted into numeric format using one-hot encoding to prepare the data for machine learning algorithms.

*   The target variable 'price' and irrelevant column 'details' are separated from the feature matrix.





In [None]:
# Load the dataset again
df = pd.read_csv('/content/SA_Aqar.csv')

# Convert 'city', 'district', and 'front' columns to numeric format using get_dummies
df_encoded = pd.get_dummies(df, columns=['city', 'district', 'front'])

# Assuming 'price' is the target variable, and excluding 'details' column as well
X = df_encoded.drop(['price', 'details'], axis=1)
y = df['price']

print(df)

         city         district front  size  property_age  bedrooms  bathrooms  \
0      الرياض       حي العارض   شمال   250             0         5          5   
1      الرياض     حي القادسية   جنوب   370             0         4          5   
2      الرياض     حي القادسية   جنوب   380             0         4          5   
3      الرياض     حي المعيزلة    غرب   250             0         5          5   
4      الرياض       حي العليا    غرب   400            11         7          5   
...       ...              ...   ...   ...           ...       ...        ...   
3713    الخبر       حي اللؤلؤ    غرب   437             0         7          5   
3714    الخبر      حي الصواري   جنوب   400             0         5          5   
3715    الخبر       حي اللؤلؤ    غرب   330             0         6          4   
3716    الخبر     حي الكورنيش   جنوب   300            13         6          5   
3717    الخبر      حي الامواج    غرب   437             0         7          5   

      livingrooms  kitchen 

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


Standardization of features is performed using the StandardScaler from scikit-learn. This step ensures that all features have a mean of 0 and a standard deviation of 1, which is essential for many machine learning algorithms, particularly those sensitive to feature scales.

In [None]:
# Initialize the StandardScaler
scaler = StandardScaler()

# Fit and transform the training feature data
X_train_scaled = scaler.fit_transform(X_train)

# Transform the testing feature data
X_test_scaled = scaler.transform(X_test)



*   A RandomForestRegressor model with 100 estimators is initialized and trained on the standardized feature data (X_train_scaled).
*   Feature importances are calculated based on the trained RandomForestRegressor model. Feature importance scores indicate the relative contribution of each feature to the prediction task

In [None]:
# Initialize the RandomForestRegressor
rf = RandomForestRegressor(n_estimators=100, random_state=42)

# Fit the model
rf.fit(X_train_scaled, y_train)

# Get feature importances
importances = rf.feature_importances_

# Create a Series of feature importances with their corresponding column names
feature_importances = pd.Series(importances, index=X.columns).sort_values(ascending=False)

print("Feature Importances:\n", feature_importances)

Feature Importances:
 size                                        0.248109
driver_room                                 0.075515
property_age                                0.056658
district_   حي الشاطئ                       0.039681
district_   حي الحمدانية                    0.037515
                                              ...   
district_   حي السفارات                     0.000000
district_   حي السامر                       0.000000
district_   حي المدينة الصناعية الجديدة     0.000000
district_   حي المزروعية                    0.000000
district_   حي شبرا                         0.000000
Length: 207, dtype: float64




*   A threshold is chosen for feature importance. Features with importance scores below this threshold are considered unimportant and filtered out.

*   
The dataset is filtered to retain only the important features based on the chosen threshold. This reduced feature dataset is saved for further analysis.



In [None]:
# Choose a threshold for feature importance
threshold = 0.05

# Filter out not important features based on the threshold
important_features = feature_importances[feature_importances >= threshold].index.tolist()

# Filter the original dataset to keep only important features
X_train_reduced = X_train[important_features]
X_test_reduced = X_test[important_features]
concatenated_df = pd.concat([X_train_reduced, X_test_reduced], ignore_index=True)


In [None]:
# Save the reduced feature datasets to Excel
# X_train_reduced.to_excel('X_train_reduced.xlsx', index=False)
# X_test_reduced.to_excel('X_test_reduced.xlsx', index=False)
concatenated_df.to_excel('concatenated_df.xlsx', index=False)
print(concatenated_df)

      size  driver_room  property_age
0      360            1             5
1      350            1            10
2      600            1             9
3      375            0             0
4      250            1             5
...    ...          ...           ...
3713   348            0            30
3714   600            1             9
3715   300            1            13
3716  3060            1             0
3717   375            0             0

[3718 rows x 3 columns]


In [None]:
# Initialize PCA
pca = PCA(n_components=0.95)

# Fit PCA on the scaled training data
X_train_pca = pca.fit_transform(X_train)

# Transform the test data
X_test_pca = pca.transform(X_test)

print(f"Original number of features: {X_train.shape[1]}")
print(f"Reduced number of features: {X_train_pca.shape[1]}")


Original number of features: 207
Reduced number of features: 1


In [None]:

# Initialize the model
rf_pca = RandomForestRegressor(n_estimators=100, random_state=42)

# Fit the model on the PCA-transformed training data
rf_pca.fit(X_train_pca, y_train)

# Predict on the PCA-transformed test data
y_pred_pca = rf_pca.predict(X_test_pca)


In [None]:
# Calculate MSE
mse_pca = mean_squared_error(y_test, y_pred_pca)
print(f"Mean Squared Error after PCA: {mse_pca}")

Mean Squared Error after PCA: 4492201686.791543
