## INSTRUCTIONS - IMPORTANT:

Every student is expected to submit their own, original solutions for this assignment. While collaborative discussions among classmates are encouraged for better understanding, it is crucial that the work you submit is your own. Copying or replicating someone else's solutions is a breach of academic integrity and will not be tolerated. The use of **AI tools** is also **prohibited** for this assignment.

The dataset used in this assignment is derived from Inside Airbnb, available [here](http://data.insideairbnb.com/the-netherlands/north-holland/amsterdam/2023-06-05/visualisations/listings.csv). It is licensed under a [Creative Commons Attribution 4.0 International License](https://creativecommons.org/licenses/by/4.0/).

**Please note that this dataset has been modified and adapted for the scope of this academic assignment. If you are interested in the original data or further Airbnb datasets, we encourage you to visit Inside Airbnb's website: [insideairbnb.com](http://insideairbnb.com).**


<h3> This assignment is divided into two main components: </h3>

1. **Data Manipulation**: Focused on cleaning and preparing the dataset.
2. **Exploratory Data Analysis (EDA)**: Concentrated on analyzing and interpreting the data.

# PART-1 (Data Cleaning and Manipulation)

We will begin by importing the required modules and reading the data file.

In [None]:
import numpy as np
from sklearn.datasets import load_iris
from sklearn import preprocessing
import pandas as pd
import seaborn as sns
from scipy import stats
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
df = pd.read_csv("airbnb_final.csv")

In [None]:
df.head()

### 1.1. **What is the shape of the dataset?**

In [None]:
#Code goes here


### 1.2 **Identify the data types of each column. Are there any columns that need type conversion?**

In [None]:
# 1.2.1 Code to identify data types goes here.


In [None]:
# 1.2.3 change the host_id column to an integer 


### 1.3. **Are there any duplicate rows in the dataset? If yes, how would you handle them?**

In [None]:
# 1.3.1 check for duplicate values


### 1.4. **Check for missing values, How would you handle the missing values in the dataset?**

In [None]:
# 1.4.1 check for missing values


In [None]:
# 1.4.3 Populate missing values in the 'price in $' column with the mean.


In [None]:
# 1.4.4 Verify that there are no more missing values in the 'price in $' column


In [None]:
# 1.4.5 Populate all missing values in the City column with "Amsterdam"


In [None]:
# 1.4.6 Verify that there are no more missing values in the 'City' column


In [None]:
# 1.4.7 Drop all remaining rows with missing data. Store the result in a new dataframe called df2.


In [None]:
df2.head()

### 1.5. Compare the shapes of the original (df) and new (df2) dataframes, and verify there are no missing values in df2.

In [None]:
# 1.5.1 code to show shapes of old and new dataframes


In [None]:
# 1.5.2 Code to verify no missing values


### 1.6. **Drop the 'latitude' and 'longitude' columns. How does it affect the shape of the dataset?**

Note: from here forward, work with the df2 DataFrame.

In [None]:
# 1.6.1 Code goes here


In [None]:
# 1.6.2 Show new shape


### 1.7. **List the  unique values in Apartment_type and Bathroom_type**

In [None]:
# 1.7.1 Find and list the unique apartment tpyes.


In [None]:
# 1.7.2 Find and list the unique bathroom types.


### 1.8. **Replace the bathroom types (shared and private) with integers (0 and 1).**

In [None]:
# 1.8.1 Replace the strings with integers


In [None]:
# 1.8.2 Verify the changes


# Part - 2 (Exploratory Data Analysis)

Exploratory Data Analysis refers to the critical process of performing initial investigations on data so as to discover patterns, to spot anomalies,to test hypothesis and to check assumptions with the help of summary statistics and graphical representations. 

### 2.1. **Describe the data.**

In [None]:
# Describe data here.


### 2.2. **Identify significant correlations.**

In [None]:
# 2.2.1 Build the correlation matrix


In [None]:
# 2.2.2 Display the correlation matrix as a heatmap


### 2.3. **Check if is there any multicollinearity.**

In [None]:
#Import the statsmodels tools needed to perform VIF.
from statsmodels.stats.outliers_influence import variance_inflation_factor
import statsmodels.api as sm

In [None]:
# 2.3.1 Select all numerical columns *except* host_id.


In [None]:
# 2.3.2 Make sure there are no missing (NaN) values 


In [None]:
# 2.3.3 Add a constant column for the VIF calculation


In [None]:
# 2.3.4 Calculate the VIF for each of the columns and display the information.(Make E.C.?)


### 2.4. **Spot outliers in the dataset.**
(we will not remove outliers for this homework).

In [None]:
# 2.4.1. Show boxplot of price in $ 


In [None]:
# 2.4.2 Show the maximum price in $


In [None]:
# 2.4.3 Show a boxplot for minimum_nights


In [None]:
# 2.4.4 Show a boxplot for Rating


In [None]:
# 2.4.5 Define a function to drop outliers beyond a specific multiplier of the IQR.


In [None]:
# 2.4.6 Find outliers for 'price in $', 'minimum_nights', and 'Rating' using the standard multiplier (1.5)


In [None]:
# 2.4.7 Display the count of outliers in each selected column


In [None]:
# 2.4.8 Find outliers for the same column using a multiplier of 3.


In [None]:
# 2.4.9 Display the new count of outliers in each selected column


### 2.5. **Compute the Average Price for Each Type of Listing.**

In [None]:
# 2.5.1 Group the data by 'room_type' and calculate the average price for each type


In [None]:
# 2.5.2 Plot and display the average price for each room type


### 2.6. **How Many Listings Are There for Each Unique 'Apartment Type'?**

In [None]:
# 2.6.1 Count the frequency of each unique 'Apartment_type'


In [None]:
# 2.6.2 Plot the frequency distribution of 'Apartment_type'


In [None]:
# 2.6.3. Find, Plot, and Display the top 10 most common apartment types


In [None]:
# 2.6.4 Find, Plot, and Display the bottom 5 least common apartment types

### 2.7. **Identify the Top 5 Neighbourhoods with the Highest Average Listing Prices.**

In [None]:
# 2.7.1 Find the top 5 most expensive neighbourhoods based on average price


In [None]:
# 2.7.2. Plot and display the top 5 most expensive neighbourhoods
