## *DATA 3300*

# **Final Project - Unit 1: Data Preparation**

## Final Project Description

The Final Project in this course is broken into three units, corresponding to the three units in the course. By the end of the course, each student will have completed a comprehensive final project on data preparation, data understanding, and data modeling. The final portion of the project will include an executive summary on the comprehensive final project you will have completed.

## Introduction

For this final project, we will take on the role of consultants for Aggie Investments, a Real Estate Investment Firm. In recent years, there has been a significant trend among investment firms to acquire properties for use as rental assets. While various geographies have been proposed, our focus is to assess the opportunities within a specific, rapidly growing market: Nashville, Tennessee.

Our task is to analyze a provided dataset containing information on current Airbnb listings in the Nashville area. The objective is to explore the data comprehensively and provide informed recommendations to Aggie Investments regarding the potential of entering this market, the types of listings they should acquire, and how they should manage those listings. The project will involve data preparation, exploration, and the application of unsupervised machine learning models to uncover deeper insights and patterns within the data. These findings will guide our final recommendations to the firm.


## **Part 1: Data Types**

**1 - Import the data3300_airbnb_data_raw_nashville.csv dataset into Python, explore the data to ensure we understand the data types that are present within the data.**

REMEMBER THE CODE CHEAT SHEET!


In [None]:
# replace with code to import the libraries and packages required to import data, manipulate dataframes, and produce plots (visualize data)

import warnings
warnings.filterwarnings("ignore")

In [None]:
# replace with code to import dataset
# replace with code to change display_option to display max columns in the dataframe

In [None]:
# replace with code to preview the df.

## **Variable And Data Types**
**2 - For each of variables listed below, identify both the data type and the variable type.**

*The field `id` has already been filled in to provide an example.*

* **id**
  * **Data Type:** float64.
  * **Variable Type:** Discrete numerical
* **host_since**
  * **Data Type:** [insert answer]
  * **Variable Type:** [insert answer]
* **host_is_superhost**
  * **Data Type:** [insert answer]
  * **Variable Type:** [insert answer]
* **availability_365**
  * **Data Type:** [insert answer]
  * **Variable Type:** [insert answer]
* **accommodates**
  * **Data Type:** [insert answer]
  * **Variable Type:** [insert answer]
* **price**
  * **Data Type:** [insert answer]
  * **Variable Type:** [insert answer]
* **reviews_per_month**
  * **Data Type:** [insert answer]
  * **Variable Type:** [insert answer]


## Additional Questions to Answer:

* 3) What is the primary key in our dataset? What is the function of the primary key?
  * Answer:
* 4) What is the difference between a continuous and discrete variable? List examples of each in the dataset.
  * Answer:
* 5) What types of variables are considered quantitative (Numerical)? List examples in the dataset.
  * Answer:

**6 - Create a Bar chart of a qualitative variable where the descriptive stat displayed the is count. What does this show us?**

In [None]:
# replace with code to create a bar chart

[replace with answer - what does this show us?]

**7 - Create a boxplot of a quantitative variable. What does this boxplot tell us about the variable?**


In [None]:
# replace with code to create boxplot

[replace with answer - what does this show us?]

**8 - Create a scatterplot of 2 continuous variables. What do we learn from this plot?**


In [None]:
# replace with code to create scatterplot

[replace with answer - what does this show us?]

## **Part 2: Data Sources**


**9 - Import the air_quality_dataframe.csv dataset into Python, and join this dataset with our listings dataset. You haven't joined two datasets in this class, so this template will help you!**

Our business partners at Aggie investments believe that adding in the Average Air Quality for listings could potentially add value to our analysis. We have utilized the following code to create a new dataset called `air_quality_dataframe.csv` via the OpenWeather Air Quality API. This dataframe has a corresponding listing `id` field as well as the Average Air quality for that listing.



In [None]:
# replace with code to import the airquality dataset, name dataframe aq_df

In [None]:
# replace with code to join the airquality dataset with our listings dataframe
df = df.merge(aq_df, on='id', how='left')
df.head()

## Questions to Answer:

* 10) Discuss the ethical considerations for our entire dataset (both Airbnb listings and AirQuality) -- Consider things like whether there is any personally identifiable (PI) data in our dataset, bias inherent in the sample of our data, ethical considerations of the impact of this task/analysis, etc.
  * Answer:
* 11) Are our data obtained from Primary or Secondary data sources?
  * Answer:


## **Part 3: Data Cleaning**


Clean and transform our Airbnb listing data set. If you need some reminders about how to do this, revisit the data cleaning module!

**12 - Think about any ethical concerns regarding this dataset. Remove any columns that personally identify hosts**

In [None]:
# replace with code to remove any columns that personally identify hosts

**13 - Go through each attribute column and perform various data transformations necessary to cleanse the dataset. For each attribute/column, report each data cleansing step performed and the underlying assumption as to why the data cleansing action was performed.**
  * Do not simply state that “all columns were trimmed” or restate the cleansing action itself.
  * State the assumption (e.g., “M” was changed to “Male” because it was assumed that “M” indicated “Male” in this dataset.).
  * Also, if no data transformations were made, state your assumption here as well (all data were assumed to be correct/clean).
  * **WE WILL ADDRESS MISSING DATA IN UNIT 2, do NOT fill in or drop missing data** unless specifically instructed to.

### **Data Transformations & Assumptions**

* **id**
  * **Action** none
  * **Assumption** all values are correct
* **name**
  * **Action**
  * **Assumption**
* **host_id**
  * **Action**
  * **Assumption**
* **host_name**
  * **Action**
  * **Assumption**
* **host_since**
  * **Action** create new column called `days_as_host` and drop this column
  * **Assumption**
* **host_is_superhost**
  * **Action**
  * **Assumption**
* **host_has_profile_pic**
  * **Action**
  * **Assumption**
* **host_identity_verified**
  * **Action**
  * **Assumption**
* **neighbourhood_group_cleansed**
  * **Action**
  * **Assumption**
* **room_type**
  * **Action**
  * **Assumption**
* **bathrooms_text**
  * **Action** Create new column called `bathrooms` and drop this column
  * **Assumption**
* **price**
  * **Action** Convert to float
  * **Assumption**

##There are a few specific transformations you will need to complete as well.

**14 - Create a new `days_as_host` column using the following hints:**
* Convert the `host_since` column to a `datetime` object
* Create the `days_as_host` column using the logic below (Note: this logic subtracts the host since date from the current date and then we pull the days from that calculation)
* Drop `host_since`

In [None]:
# convert host_since to datetime object

df[''] = (pd.to_datetime('today').normalize() - 'replace_this_val_with_host_since_date').dt.days # create new column, days_as_host by subtracting host_since from today's date

# drop host_since

**15 - Create a new column called `bathrooms`**
* Begin by examining the `value_counts` of `bathrooms_text`
* Replace any numbers written in word form with the corresponding number (e.g., zero baths --> 0 baths)
* Create a new column called `bathrooms` by splitting the text from bathrooms_text on a space delimiter and extracting the first value
* Fill in missing values with 0 and drop `bathrooms_text`

In [None]:
# check value_counts of bathroom_text

# string replace any numbers in word form with the corresponding number form
df['bathrooms_text'] = df['bathrooms_text'].str.replace('', '')

# create new bathrooms column by extracting first value from 'bathrooms_text'
df[''] =  df[''].str.split(' ', n=1, expand=True)[0].astype(float)

# fill in missing values with 0
df['bathrooms'] =  df['bathrooms'].fillna('insert value')

# drop 'bathrooms_text' from df

**16 - Create a column called `short_term`. This column will be 1 if the `minimum_nights` column is less than 30, and 0 otherwise.**

In [None]:
# replace with code to create short_term column

**17 -  Any columns containing 't' and 'f' as values (True, False), should be converted to 1, 0.**

In [None]:
# replace with code to replace t,f values with 1,0

**18 - We want to treat `price` as a float, but it's currently an object. Remove any text characters, then convert to float.**

In [None]:
# replace with code to remove text characters from price, then convert to float

**19 - You should drop variables that are not relevant to the analysis in this step (i.e., do we need the lat and long of properties or is that unnecessary info?). We will examine missing data and outliers in the Unit 2 Assessment, so don't worry about `int` or `float` columns for now (unless they should be dropped).**

In [None]:
# remove columns based on your assumptions identified and perform other data cleaning steps as necessary.
# you might consider using a different code cell for each variable/column you make any changes to

**20 - Display the finalized clean dataset**

In [None]:
# replace with code to display finalized clean dataset