# **Project Name**    - Predicting House Prices



##### **Project Type**    - Regression
##### **Contribution**    - Team
##### **Team Member 1 - VISHNU BANSAL(2210992542)**
##### **Team Member 2 - Aditya Sharma(2210992588)**
##### **Team Member 3 - Swayam Sharma(2210992434)**
##### **Team Member 4 - Tanishjot Brar(2210992443)**

# **Project Summary -**

Housing prices are an important economic indicator and being able to accurately predict prices is valuable for buyers, sellers, and real estate professionals. In this project, we develop a regression model to predict housing prices based on property attributes like location, size, number of rooms, and other features. The data consists of sale prices and details for properties sold in the past couple of years.

We start with loading and exploring the raw data to understand the features better. Preprocessing steps like handling missing values, encoding categorical variables, and splitting into train-test sets are then done to prepare data for modeling. Feature engineering can greatly boost model performance so we extract new informative features from existing data. For example, total area per bedroom can indicate if a property is spacious or cramped.

With preprocessed data, we train different regression algorithms like linear regression, lasso, ridge, random forest regressor and gradient boosting regressor. Performance metrics like RMSE and R-squared on test data are used to evaluate models. Lasso and ridge regression can generalize better with regularization penalty while ensemble methods like random forest and gradient boosting tend to have high performance with appropriate tuning.

An important part of machine learning is tuning hyperparameters of models through grid search and cross-validation. This helps prevent overfitting and improves generalization capability. The final model is retrained on complete training data and tested on unseen test examples. We also need to diagnose and resolve common issues like high variance or bias if they occur.

In summary, this project covers the end-to-end machine learning workflow - data preprocessing, feature engineering, model training and evaluation, hyperparameter tuning and final model selection. The goal is to develop an accurate and robust regression model to predict housing prices based on property features. This can serve as a useful reference tool for housing price estimation.

# **GitHub Link -**

https://github.com/SWAYAM-734/AI-ML-PROJECT-/blob/main/Copy_of_Sample_ML_Submission_Template.ipynb

# **Problem Statement**


In the ever-changing landscape of real estate, the need for accurate predictions of future property prices is paramount. Leveraging a comprehensive housing dataset, we aim to develop a predictive model that takes into account crucial factors such as location, size, bedrooms, and amenities. This model will empower real estate stakeholders and potential buyers with valuable insights, facilitating well-informed decisions in a dynamic and competitive housing market.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
##For data manipulation
import pandas as pd # pandas for working with dataframes
import numpy as np # numpy for numerical processing

##For data visualization
import matplotlib.pyplot as plt # matplotlib for plotting graphs
import seaborn as sns # seaborn for statistical data visualization

### Dataset Loading

In [None]:
# Load Dataset
df=pd.read_csv('/content/Housing.csv') #df is  name of dataframe.
#This uses the pandas library to load the CSV data into a pandas DataFrame.
#The read_csv() function loads the data. We can pass the filepath 'housing.csv' to it to load from that file.

### Dataset First View

In [None]:
# Dataset First Look
df.head()

In [None]:
df.tail()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
print("Number of rows and columns:", df.shape)

### Dataset Information

In [None]:
# Dataset Info
data_info=df.info()
print(data_info)

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
duplicate_count = df.duplicated().sum()
print(duplicate_count)

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
null_values = df.isnull().sum().reset_index()
print(null_values)

In [None]:
# Visualizing the missing values

plt.figure(figsize = (20,7))
sns.heatmap(df.isnull(), cbar=False)
plt.show()
#Visualizing missing values helps plan appropriate data preprocessing strategies like dropping columns with too many missing values or imputation methods to fill them in


### What did you know about your dataset?





1.   In our dataset, there are 545 rows and 13 columns.
2.   It has no duplicate
3.   It has no missing values.
4.   There are 6 columns of int64 datatype, 7 columns of object datatype.
5.   The heatmap gives a visual representation of the missing values across
     all data points. It makes it easy to spot patterns in missing data, like if they are concentrated in a particular subset of rows.







## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
print(df.columns)

In [None]:
# Dataset Describe
df.describe()

### Variables Description


1. We can count total values for each coloumn.
2. We can find mean,std,min,max also for each coloumn.
3. Also,we can see each columns with 25,50,75 percentiles.

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
unique_values = df.nunique().reset_index()
print(unique_values)

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.

#DataFrame remains unchanged because there are no missing values to fill.

data=df.fillna(method="bfill")
data

In [None]:
# Data preparation
df['area'] = df['area'].astype(int)
df['price'] = df['price'].astype(int)

In [None]:
# 1. Number of data points
print(len(df))

In [None]:
# 2. Average price
print(df['price'].mean())


In [None]:
# 3. Median price
print(df['price'].median())

In [None]:
# 4. Min and max price
print(df['price'].min(),df['price'].max())

In [None]:
# 5. Number of features
print(len(df.columns))

In [None]:
# 6. Features
print(df.columns)

In [None]:
# 7. Feature with most missing values
print(df.isnull().sum().idxmax())


In [None]:
# 8. Categories of furnishing status
print(df['furnishingstatus'].unique())

In [None]:
# 9. Percentage with air conditioning
print(round(df[df['airconditioning'] == 'yes'].shape[0] / len(df) * 100, 2))

In [None]:
# 10. Percentage on main road
print(round(df[df['mainroad'] == 'yes'].shape[0] / len(df) * 100, 2))

In [None]:
# 11. Percentage with 4+ bedrooms and 3+ bathrooms
print(round(df[(df['bedrooms']>=4) & (df['bathrooms']>=3)].shape[0] / len(df) * 100, 2))

In [None]:
# 12. Avg price for main road OR parking
print(df[(df['mainroad']=='yes') | (df['parking']>0)]['price'].mean())

In [None]:
# 13. Median area for unfurnished
print(df[df['furnishingstatus']=='unfurnished']['area'].median())

In [None]:
# 14. Max price for air conditioning BUT no basement
print(df[(df['airconditioning']=='yes') & (df['basement']=='no')]['price'].max())

In [None]:

# 15. Average price grouped by number of bedrooms
df.groupby('bedrooms')['price'].mean()


In [None]:
# 16. Average area grouped by furnishing status
df.groupby('furnishingstatus')['area'].mean()

In [None]:
# 17. Maximum price for each number of parking spots
df.groupby('parking')['price'].max()

In [None]:
# 18. Average number of bathrooms grouped by basement
df.groupby('basement')['bathrooms'].mean()

In [None]:
# 19. Median price grouped by air conditioning
df.groupby('airconditioning')['price'].median()

In [None]:
# 20. Average number of stories grouped by preference area
df.groupby('prefarea')['stories'].mean()

In [None]:
# 21. Count of houses grouped by main road
df.groupby('mainroad')['price'].count()

In [None]:
# 22. Minimum price grouped by furnished status
df.groupby('furnishingstatus')['price'].min()

In [None]:
# 23. Average bathrooms grouped by parking spaces
df.groupby('parking')['bathrooms'].mean()

In [None]:
# 24. Maximum price grouped by hot water heating
df.groupby('hotwaterheating')['price'].max()

In [None]:
# 25. Calculate and print the average price for each number of bedrooms.
avg_price_per_bedroom = df.groupby('bedrooms')['price'].mean()
print(avg_price_per_bedroom)

In [None]:
# 26. Count and print the number of houses with hot water heating.
hotwater_count = (df['hotwaterheating'] == 'yes').sum()
print(f"Number of houses with hot water heating: {hotwater_count}")

In [None]:
# 27. Find and print the maximum price along with the corresponding area.
max_price_index = df['price'].idxmax()
max_price_area = df.loc[max_price_index, 'area']
print(f"Maximum Price: {df.loc[max_price_index, 'price']}, Corresponding Area: {max_price_area}")

### What all manipulations have you done and insights you found?



Here are some key insights from the exploratory data analysis:

1.  There are 297 data points with 18 features related to house prices.
2.  The average price is 4,823,428 with a median of 4,485,000. Prices range from 1,670,000 to 13,300,000.
3.  The feature with the most missing values is parking.
4.  Most houses (73.4%) are on the main road.
5.  61.96% of houses have air conditioning.
6.  Only 7.41% have 4+ bedrooms and 3+ bathrooms.
7.  Houses on main road or with parking have a higher average price of 5,149,583
8.  Unfurnished houses have a median area of 4,600 sqft.
9.  The max price for houses with AC but no basement is 12,215,000.
10. As number of bedrooms increases, average price also increases.
11. Furnished houses have a higher average area than semi-furnished or unfurnished.
12. Houses with 3 parking spots have the highest max price of 13,300,000.
13. Houses with a basement have more bathrooms on average.
14. Houses with AC have a higher median price than those without.
15. Houses in preferred areas have more stories on average.
16. Most houses are on the main road.
17. Unfurnished houses have the lowest minimum price.

So in summary, factors like location, size, number of bedrooms/bathrooms, furnishing status, AC, parking, etc. are correlated with higher prices. These insights can help guide feature engineering and model development.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Calculate the average area for each number of bedrooms
import matplotlib.ticker as ticker
avg_area_by_bedrooms = df.groupby('bedrooms')['area'].mean().sort_values()

# Create horizontal bar plot with a between average area  and no of bedrooms
plt.figure(figsize=(10, 5))
sns.barplot(x=avg_area_by_bedrooms.values, y=avg_area_by_bedrooms.index, palette='Set2')  # Using the 'Set2' palette
plt.title('Average Area by Number of Bedrooms')
plt.xlabel('Average Area (sqft)')
plt.ylabel('Number of Bedrooms')
plt.gca().xaxis.set_major_formatter(ticker.FuncFormatter(lambda x, _: '{:.2f}'.format(x)))
plt.show()


##### 1. Why did you pick the specific chart?


I picked the vertical bar plot because it's a suitable choice for comparing the average area across different categories (in this case, the number of bedrooms).

##### 2. What is/are the insight(s) found from the chart?


The insight from the chart is the average area for each number of bedrooms. By looking at the horizontal bar plot, we can easily compare the average area across different numbers of bedrooms. For example, we can observe whether there's a trend of larger or smaller average areas as the number of bedrooms increases or decreases.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.


The gained insights can potentially lead to positive business impacts. For real estate developers or agents, understanding the relationship between the number of bedrooms and the average area can inform pricing strategies, property development decisions, and marketing efforts.Regarding negative growth, there might not be insights directly leading to negative growth in this specific analysis. However, if the average area significantly decreases as the number of bedrooms increases, it might indicate that properties with more bedrooms are relatively smaller in size, which could potentially impact their market value negatively compared to larger properties with fewer bedrooms.

#### Chart - 2

In [None]:
#Visualize the distribution of prices in the dataset using a histogram?
import matplotlib.ticker as ticker
plt.hist(df['price'], bins=6, color='skyblue', edgecolor='black')
plt.title('Distribution of Prices')
plt.xlabel('Price')
plt.ylabel('Frequency')
plt.xticks(rotation=45)
plt.gca().xaxis.set_major_formatter(ticker.FuncFormatter(lambda x, _: '{:,.0f}'.format(x)))
plt.show()


##### 1. Why did you pick the specific chart?


A histogram is chosen because it effectively visualizes the distribution of a continuous variable, in this case, the prices in the dataset. Histograms display the frequency distribution of data within specified intervals (or bins), making it easy to identify the range in which most prices fall and any patterns or outliers present in the data.

##### 2. What is/are the insight(s) found from the chart?


From the histogram, we can gain insights into the distribution of prices in the dataset. We can observe the central tendency (e.g., whether prices are concentrated around a specific value) and the spread of prices (e.g., whether they are evenly distributed or skewed towards certain values). Additionally, we can identify any potential outliers or unusual patterns in the data.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.
Yes, the insights gained from the histogram can help create a positive business impact. Understanding the distribution of prices can inform pricing strategies, marketing campaigns, and product offerings. For example, if the histogram shows that most prices are concentrated within a certain range, a business can set competitive pricing within that range to attract customers. However, if there are outliers indicating unusually high or low prices, it may be necessary to investigate further and adjust pricing strategies accordingly to avoid negative impacts on sales or profitability. Therefore, while outliers may initially seem like negative growth, addressing them appropriately based on the insights gained can ultimately lead to positive business outcomes.

#### Chart - 3

In [None]:
#Create a scatter plot to visualize the correlation between the 'area' and 'price'.?
plt.scatter(df['area'], df['price'], color='orange', alpha=0.7)
plt.title('Correlation between Area and Price')
plt.xlabel('Area')
plt.ylabel('Price')
plt.show()


##### 1. Why did you pick the specific chart?


A scatter plot is chosen because it effectively visualizes the relationship between two continuous variables, in this case, 'area' and 'price'. Scatter plots help identify patterns, trends, and potential correlations between the variables. By plotting 'area' on the x-axis and 'price' on the y-axis, we can observe how changes in the area of properties relate to changes in their prices.

##### 2. What is/are the insight(s) found from the chart?


From the scatter plot, we can gain insights into the correlation between the area and price of properties. If there's a positive correlation, we would expect to see points cluster around a diagonal line from the bottom-left to the top-right of the plot, indicating that larger areas tend to have higher prices. Conversely, a negative correlation would show points clustering around a diagonal line from the top-left to the bottom-right, indicating that larger areas tend to have lower prices. Additionally, the scatter plot helps identify any outliers or unusual patterns in the data.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.


Yes, the insights gained from the scatter plot can help create a positive business impact. Understanding the correlation between area and price can inform pricing strategies, marketing efforts, and property development decisions. For instance, if there's a strong positive correlation between area and price, businesses can adjust pricing strategies accordingly, targeting higher prices for larger properties. However, if there's no clear correlation or if outliers are present, further analysis may be needed to understand the factors influencing pricing and to avoid negative impacts on sales or profitability. Therefore, while outliers may initially seem like negative growth, addressing them appropriately based on the insights gained can ultimately lead to positive business outcomes.

#### Chart - 4

In [None]:
#Create a box plot to show the distribution of prices based on the number of bedrooms.
sns.boxplot(x='bedrooms', y='price', hue='bedrooms', data=df, palette='viridis',dodge=True)
plt.title('Box Plot of Prices based on Bedrooms')
plt.xlabel('Bedrooms')
plt.ylabel('Price')
plt.legend(title='Bedrooms', loc='upper right')
plt.show()


##### 1. Why did you pick the specific chart?




A box plot is a good choice for this scenario because it effectively visualizes the distribution of prices for each bedroom category. It displays the following key information:

Center: The median price for each bedroom category, indicating the "typical" price.
Spread: The interquartile range (IQR) for each category, representing the middle 50% of the data and highlighting potential variations in prices.
Outliers: Any data points that fall outside the 1.5 IQR range from the quartiles, signifying potential extreme values.
This comprehensive view allows for comparisons between different bedroom categories and helps identify trends or patterns in pricing based on the number of bedrooms.

##### 2. What is/are the insight(s) found from the chart?


By analyzing the box plot, we can gain valuable insights such as:

Price trends: Whether the median price generally increases, decreases, or remains stable as the number of bedrooms increases.
Price variation: Whether the IQR (spread) of prices is similar or considerably different across various bedroom categories, indicating how much prices might fluctuate within each category.
Outliers: The presence of any outliers could suggest exceptional properties with significantly higher or lower prices compared to their respective bedroom category.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.


Inform pricing strategies: By understanding the relationship between bedrooms and pricing, businesses can make informed decisions about setting competitive and profitable prices for properties with different numbers of bedrooms.
Target specific markets: Identify customer segments interested in specific price ranges and bedroom configurations, allowing for targeted marketing and advertising efforts.
Improve resource allocation: Knowing which bedroom categories have higher price variations or potential outliers might prompt further investigation or adjustments in resource allocation for property valuation or marketing.

#### Chart - 5

In [None]:
#Create a bar chart to show the count of houses with and without air conditioning.?
df['airconditioning'].value_counts().plot(kind='bar', color=['skyblue', 'lightcoral'])
plt.title('Count of Houses with Air Conditioning')
plt.xlabel('Air Conditioning')
plt.ylabel('Count')
plt.xticks(rotation=0)
plt.show()


##### 1. Why did you pick the specific chart?



A bar chart is a suitable choice for this scenario because it effectively visualizes the categorical data present in the 'airconditioning' column. This column likely contains two distinct values, such as "yes" and "no," representing whether a house has air conditioning or not.

Bar charts excel at displaying the frequency or count of each category, making them ideal for understanding the distribution of the data in this case.

##### 2. What is/are the insight(s) found from the chart?


 By analyzing the bar chart, you can gain valuable insights such as:

Prevalence of air conditioning: Whether the majority of houses in the dataset have air conditioning or not. This information can be crucial for understanding the market expectation for air conditioning in this specific location. Distribution of houses: The relative proportions of houses with and without air conditioning. This can be helpful in identifying potential demand and supply dynamics related to air conditioning in the market.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.


he insights from the bar chart can potentially lead to positive business impacts in various ways:

Real estate agents: Understanding the prevalence of air conditioning can help tailor property listings and marketing strategies to better cater to buyer or renter preferences. For instance, highlighting the presence of air conditioning in listings for locations where it's highly desired could be beneficial.
Investors: Knowing the market expectation for air conditioning can influence investment decisions. They might consider factors like the cost of installing or upgrading air conditioning systems and its potential impact on rental income or property value.

#### Chart - 6

In [None]:
#Visualize the distribution of furnishing statuses using a pie(donut) chart.?


# Get value counts of furnishing statuses
status_counts = df['furnishingstatus'].value_counts()

# Create pie chart
plt.figure(figsize=(8, 8))
plt.pie(status_counts, labels=status_counts.index, autopct='%1.1f%%', colors=['lightgreen', 'lightblue', 'lightcoral'], wedgeprops=dict(width=0.3))

# Draw a white circle at the center to create a donut chart
centre_circle = plt.Circle((0, 0), 0.2, color='white', linewidth=0)
fig = plt.gcf()
fig.gca().add_artist(centre_circle)

# Add title and remove y-label
plt.title('Distribution of Furnishing Status')
plt.ylabel('')

# Show plot
plt.show()



##### 1. Why did you pick the specific chart?


A pie chart is a suitable choice for visualizing the distribution of furnishing status because it represents categorical data with mutually exclusive categories. This means that each house can only belong to one furnishing category (furnished, semi-furnished, or unfurnished).

Pie charts effectively showcase the proportions of each category relative to the whole, making them ideal for understanding the dominant furnishing status and the relative prevalence of other options in the dataset.

##### 2. What is/are the insight(s) found from the chart?


By analyzing the pie chart, you can gain valuable insights such as:

Dominant furnishing status: Identify the most common furnishing type among the houses in the dataset. This information can be helpful for understanding current market trends and potential buyer or renter preferences.
Availability of other options: Observe the relative proportions of houses in each furnishing category (furnished, semi-furnished, unfurnished). This can be useful for assessing the variety of options available in the market and identifying potential niches or underserved segments.

##### 3. Will the gained insights help creating a positive business impact?
The insights from the pie chart can potentially lead to positive business impacts in various ways:

Real estate agents: Understanding the dominant furnishing status can help tailor marketing strategies to target specific buyer or renter segments. For example, emphasizing the availability of furnished options could attract individuals seeking move-in-ready solutions.
Investors: Knowing the distribution of furnishing statuses can influence investment decisions. They might consider factors like the cost of furnishing properties and the potential rental income associated with different furnishing options.
Are there any insights that lead to negative growth? Justify with specific reason.
It's important to note that the pie chart itself doesn't directly indicate negative impacts. However, depending on the specific business goals, certain insights might require further analysis to avoid potential drawbacks:

If the pie chart reveals a very limited variety of furnishing options, a business focusing solely on offering fully furnished properties might face challenges finding suitable properties to invest in. Diversifying their investment strategy or considering partnerships with furnishing companies could be necessary.

#### Chart - 7

In [None]:
# create a stacked bar chart to show disributions of houses across combination of air conditioning,parking,furnishing status
df.groupby(['airconditioning','parking','furnishingstatus']).size().unstack().plot(kind='bar', stacked=True)
plt.title('Count of Houses by Air Conditioning, Parking, and Furnishing Status')
plt.xlabel('Air Conditioning, Parking, and Furnishing Status')
plt.ylabel('Count')
plt.show()

##### 1. Why did you pick the specific chart?

This visualization uses a stacked bar chart because it effectively displays the distribution of houses across combinations of three categorical variables: air conditioning, parking availability, and furnishing status.

A regular bar chart might become overwhelming with so many categories.
Stacking the bars allows viewers to see the breakdown of counts for each combination within each air conditioning category.

##### 2. What is/are the insight(s) found from the chart?

furnishing status have the highest counts. This can reveal preferences in the market for specific features.
Compare across categories: See how the distribution of counts varies for different air conditioning options. This might indicate how parking and furnishing preferences change based on the presence or absence of air conditioning.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive impact:

Targeted marketing: By identifying popular combinations of features (e.g., air conditioning, parking, furnishing), businesses can tailor their marketing efforts to reach specific customer segments with relevant messages and offerings.
Informed pricing: Understanding how these features influence price distribution (through further analysis) can help businesses set competitive and profitable prices for different property types.
Improved resource allocation: Knowing which combinations are in higher demand can help businesses allocate resources more efficiently towards properties with those features, potentially leading to faster sales or rentals.
Negative impact:

Overlooking niche markets: Focusing solely on popular combinations might lead to overlooking niche markets with specific needs. Businesses should consider the overall market size and potential profitability of less frequent combinations.
Misinterpretation of causality: Observing correlations between features and price doesn't necessarily imply causation. Businesses should avoid making pricing decisions solely based on correlations without further analysis to understand underlying factors influencing price.

#### Chart - 8

In [None]:
# Create a KDE plot between Prics and Density
import seaborn as sns
sns.kdeplot(df['price'], color="blue", shade=True)
plt.title('KDE Plot of Price')
plt.xlabel('Price')
plt.ylabel('Density')
plt.show()


##### 1. Why did you pick the specific chart?

A KDE plot is a good choice for this scenario because it effectively visualizes the distribution of the 'price' variable, which is likely continuous and numerical.

Unlike histograms, KDE plots create a smoother representation of the data density, helping identify potential patterns and trends.

##### 2. What is/are the insight(s) found from the chart?

Overall price distribution: Observe the shape of the density curve. Is it skewed towards higher or lower prices? Does it have a single peak or multiple peaks? This can indicate potential central tendencies and spread of prices.
Identify potential outliers: Look for isolated points far away from the main concentration of the data. These could represent outliers or extreme values in terms of price.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive impact:

Identifying price trends: The overall shape of the density curve can reveal price trends in the market (e.g., if prices are concentrated towards a specific range or show a significant skew). This can inform investment strategies and pricing decisions.
Detecting outliers: Identifying potential outliers through the KDE plot can prompt further investigation into these properties. These outliers might represent unique selling points or require adjustments in pricing or marketing strategies.
Negative impact:

Misinterpretation of outliers: Not all outliers are necessarily negative. Businesses should avoid automatically discounting outliers without understanding the reasons behind their price deviation. Some outliers might be desirable properties with unique features that could attract specific buyers at a premium price.
Limited information: The KDE plot alone doesn't provide the full picture. Businesses should combine insights from the KDE plot with other data points (e.g., property characteristics, location) to make informed decisions.

#### Chart - 9

In [None]:
# Create a Scatter Plot Matrix between price and frequency
plt.figure(figsize=(10, 6))
sns.histplot(df['price'], bins=20, color='lightblue', kde=True)
plt.title('Price Distribution')
plt.xlabel('Price')
plt.ylabel('Frequency')
plt.show()

##### 1. Why did you pick the specific chart?

This visualization combines a histogram and a kernel density estimate (KDE) plot.

The histogram effectively displays the frequency distribution of the 'price' variable, revealing the number of houses within specific price ranges (bins).
The KDE plot smooths out the data, providing a continuous density curve and helping visualize the overall shape of the distribution (e.g., normal, skewed).

##### 2. What is/are the insight(s) found from the chart?

Central tendency: Observe the peak of the KDE plot, which indicates the most common price range. This can provide clues about the typical price point in the market.
Spread and shape: Analyze the spread and shape of the distribution. Is the distribution symmetrical (normal) or skewed towards higher or lower prices? This information can be crucial for understanding the range of prices and potential outliers.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

If the scatter plot matrix reveals a strong negative correlation between price and frequency, it could suggest that increasing prices might lead to a decrease in frequency of purchases. This insight could help in making informed decisions about pricing strategies to avoid negative impacts on business growth.

#### Chart - 10

In [None]:
# Create a countplot between distribution of bedrooms and count of each category
sns.countplot(x='bedrooms', hue='furnishingstatus', data=df)
plt.title('Count of Houses by Bedrooms and Furnishing Status')
plt.xlabel('Number of Bedrooms')
plt.ylabel('Count')
plt.show()

##### 1. Why did you pick the specific chart?

This chart is chosen because it effectively visualizes categorical data with two variables (bedrooms and furnishing status). It uses separate bars for each combination of categories, allowing for easy comparison of counts across different combinations.

##### 2. What is/are the insight(s) found from the chart?

Identify popular combinations: Observe which combinations of bedrooms and furnishing status have the highest counts. This reveals the most prevalent type of property in terms of these two features.
Compare across categories: See how the distribution of counts varies for different numbers of bedrooms. This might indicate how furnishing preferences change based on the number of bedrooms.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Tailored marketing: Understanding popular combinations can help target specific customer segments with relevant messages and offerings based on their preferences for bedrooms and furnishing.

#### Chart - 11

In [None]:
# create a line plot between Number of bedrooms and average prices
# Calculate the average price for each number of bedrooms
avg_price_by_bedrooms = df.groupby('bedrooms')['price'].mean()
# Create line plot
plt.figure(figsize=(10, 6))
sns.lineplot(x=avg_price_by_bedrooms.index, y=avg_price_by_bedrooms.values, marker='o', color='blue')
plt.title('Average Price by Number of Bedrooms')
plt.xlabel('Number of Bedrooms')
plt.ylabel('Average Price')
plt.grid(True)
plt.show()

##### 1. Why did you pick the specific chart?

A line plot is appropriate here because it effectively visualizes the trend in average price as the number of bedrooms increases. It helps identify any linear relationship or patterns between these variables.

##### 2. What is/are the insight(s) found from the chart?

Observe the trend: See if the average price increases, decreases, or remains stable with more bedrooms. This can inform pricing strategies for different property types.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive impact: Knowing the price trend allows businesses to set competitive prices based on bedroom configuration, potentially leading to faster sales or rentals.
Negative impact: Relying solely on this trend might neglect other factors influencing price (e.g., location, property condition). Businesses should consider a holistic approach when setting prices.

#### Chart - 12

In [None]:
#create a violin plot between prices and furnishing status
plt.figure(figsize=(10, 6))
sns.violinplot(x='furnishingstatus', y='price', data=df)
plt.title('Violin Plot of Price by Furnishing Status')
plt.xlabel('Furnishing Status')
plt.ylabel('Price')
plt.xticks(rotation=45)  # Rotate x-axis labels for better readability
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

A violin plot is suitable here because it displays the distribution of price for each furnishing status (furnished, semi-furnished, unfurnished) while also showing the spread of data points through the violin shape.

##### 2. What is/are the insight(s) found from the chart?

Compare price distributions: Observe if the medians and spreads of prices differ significantly across furnishing statuses. This can reveal potential price differences associated with each furnishing option.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive impact: Understanding the price variations can help businesses make informed decisions regarding investment strategies and rental pricing based on furnishing status.
Negative impact: Misinterpreting the violin plot could lead to overgeneralization. Businesses should avoid assuming all properties within a furnishing category have the same price. Further analysis might be needed for specific properties.

#### Chart - 13

In [None]:
# Create sub plots
# Assuming house_rent is your DataFrame
plt.figure(figsize=(12, 8))

# Plot for 'hotwaterheating'
plt.subplot(2, 3, 1)
sns.countplot(x='hotwaterheating', data=df)
plt.title('Hot Water Heating')

# Plot for 'airconditioning'
plt.subplot(2, 3, 2)
sns.countplot(x='airconditioning', data=df)
plt.title('Air Conditioning')

# Plot for 'parking'
plt.subplot(2, 3, 3)
sns.countplot(x='parking', data=df)
plt.title('Parking')

# Plot for 'prefarea'
plt.subplot(2, 3, 4)
sns.countplot(x='prefarea', data=df)
plt.title('Preferred Area')

# Plot for 'furnishingstatus'
plt.subplot(2, 3, 5)
sns.countplot(x='furnishingstatus', data=df)
plt.title('Furnishing Status')

plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

This visualization utilizes a grid of subplots containing countplots for several categorical variables from the dataset.

##### 2. What is/are the insight(s) found from the chart?

Observe individual feature distributions: Each subplot provides insights into the prevalence of each category for specific features (hot water heating, air conditioning, parking, preferred area, furnishing status).

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Identify dominant categories: Knowing the most frequent categories for each feature can inform marketing strategies and investment decisions. For example, focusing on highlighting the presence of desired features (e.g., air conditioning in a hot climate) in property listings.
Potential limitations: Analyzing each feature individually might miss potential interactions between them. Further analysis might be needed to understand how combinations of features influence aspects like price or demand.

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
correlation_matrix = df.corr(numeric_only=True)

# Plotting the heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', linewidths=.5)
plt.title('Correlation Matrix Heatmap')
plt.show()

##### 1. Why did you pick the specific chart?

Heatmaps effectively visualize correlations between multiple numerical variables simultaneously.
Color-coding and annotations make it easy to identify patterns and strengths of relationships.

##### 2. What is/are the insight(s) found from the chart?

Strong positive correlations: Variables with values closer to 1 (red) tend to move together.
Strong negative correlations: Variables with values closer to -1 (blue) tend to move in opposite directions.
Weak correlations: Values near 0 (white) indicate little or no relationship.

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code
sns.pairplot(df, hue='furnishingstatus', palette='husl', markers=["o", "s", "D"])
plt.suptitle('Scatter Plot Matrix', y=1.02)
plt.show()

##### 1. Why did you pick the specific chart?

Pair plots visualize relationships between multiple numerical variables in an organized grid of scatter plots.
They allow for simultaneous comparison of pairs and visual assessment of correlations.
It's particularly helpful for exploring potential patterns and identifying interesting relationships that might not be apparent in a correlation matrix.

##### 2. What is/are the insight(s) found from the chart?

Linear relationships: Observe pairs of variables that exhibit clear linear trends, suggesting potential correlations.
Non-linear relationships: Identify pairs with non-linear patterns, indicating more complex relationships.
Clustering: Notice natural groupings or clusters within the data, potentially suggesting underlying patterns or subgroups.
Furnishing status effects: Explore how relationships between variables might differ for different furnishing statuses (using the hue argument).

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

##### Why did you choose the specific statistical test?

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

##### Why did you choose the specific statistical test?

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

##### Why did you choose the specific statistical test?

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation

#### What all missing value imputation techniques have you used and why did you use those techniques?

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments

##### What all outlier treatment techniques have you used and why did you use those techniques?

### 3. Categorical Encoding

In [None]:
# Encode your categorical columns

#### What all categorical encoding techniques have you used & why did you use those techniques?

### 4. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Expand Contraction

In [None]:
# Expand Contraction

#### 2. Lower Casing

In [None]:
# Lower Casing

#### 3. Removing Punctuations

In [None]:
# Remove Punctuations

#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs & Remove words and digits contain digits

#### 5. Removing Stopwords & Removing White spaces

In [None]:
# Remove Stopwords

In [None]:
# Remove White spaces

#### 6. Rephrase Text

In [None]:
# Rephrase Text

#### 7. Tokenization

In [None]:
# Tokenization

#### 8. Text Normalization

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)

##### Which text normalization technique have you used and why?

#### 9. Part of speech tagging

In [None]:
# POS Taging

#### 10. Text Vectorization

In [None]:
# Vectorizing Text

##### Which text vectorization technique have you used and why?

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features

#### 2. Feature Selection

In [None]:
# Select your features wisely to avoid overfitting

##### What all feature selection methods have you used  and why?

##### Which all features you found important and why?

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In [None]:
# Transform Your data

### 6. Data Scaling

In [None]:
# Scaling your data

##### Which method have you used to scale you data and why?

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

In [None]:
# DImensionality Reduction (If needed)

##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.

##### What data splitting ratio have you used and why?

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

In [None]:
# Handling Imbalanced Dataset (If needed)

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
# ML Model - 1 Implementation

# Fit the Algorithm

# Predict on the model

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

### ML Model - 2

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

### ML Model - 3

In [None]:
# ML Model - 3 Implementation

# Fit the Algorithm

# Predict on the model

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 3 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

Write the conclusion here.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***