## COMP2006 -- Graded Lab 1

In this lab, you will gain some experience in **denoising** a dataset in the context of a specific objective. 

**Overall Objective**: Create a model that predicts rent prices as well as possible for typical New York City apartments.

**Data set**: make sure you use the data with the same number as your group number!

| Group | Data set |
| :-: | :-: |
| 1 | rent_1.csv |
| 2 | rent_2.csv |
| etc. | etc. |

**Important Notes:**
 - This lab is more open-ended so be prepared to think on your own, in a logical way, in order to solve the problem at hand
     - You should be able to support any decision you make with logical evidence
 - The data looks like the data we have been using in class but it has other **surprises**
     - Be sure to investigate the data in a way that allows you to discover all these surprises
 - Use [Chapter 5](https://mlbook.explained.ai/prep.html) of the textbook as a **guide**, except:
     - you only need to use **random forest** models;
     - exclude Section 5.5; 
 - Code submitted for this lab should be:
     - error free
         - to make sure this is the case, before submitting, close all Jupyter notebooks, exit Anaconda, reload the lab notebook and execute all cells
     - final code
         - this means that I don't want to see every piece of code you try as you work through this lab but only the final code; only the code that fulfills the objective
 - Use the **out-of-bag score** to evaluate models
     - Read Section 5.2 carefully so that you use this method properly
     - The oob score that you provide should be the average of 10 runs
 - Don't make assumptions!

I have broken the lab down into 4 main parts. 

### Part 0

Please provide the following information:
 - Group Number: 
 - Group Members
     - Name (Student ID)
     - Name (Student ID)
     - Name (Student ID)

     

### Part 1 - Create and evaluate an initial model

#### Code (15 marks)

In [2]:
import pandas as pd

# look at the full dataset before selecting just the numeric columns
rent = pd.read_csv('rent_9.csv')
rent.head().T

Unnamed: 0,0,1,2,3,4
bathrooms,2.0,1.0,1.0,1.0,2.0
bedrooms,2,0,2,1,2
building_id,96835a48bbe776c37aec3dc0f3df8887,3a956bd42c50f06ac84cf072fc514f5f,1a5bc67bc49344ad10e9422ced5a73e9,2140cd5e5734e0a9565759e593d18c87,c5ac932395aabfd2e04f782fef984f06
created,2016-04-06 18:20:53,2016-04-10 02:35:03,2016-06-09 23:42:21,2016-04-09 01:42:37,2016-04-24 02:35:50
description,,Luxury STUDIO in this Full Service Luxury Buil...,"Call, text, or email Mike or Tony to set up a ...",,2 BEDROOM 2 BATH UNIT WITH A DINING ROOM AREA!...
display_address,West Street,W 42nd St.,E 83rd St and 1st Ave,West 102nd Street,East 80th Street
features,"['Doorman', 'Cats Allowed', 'Dogs Allowed']","['Swimming Pool', 'Roof Deck', 'Doorman', 'Ele...",[],"['Prewar', 'Elevator', 'Laundry Room']","['Dining Room', 'Doorman', 'Elevator', 'Pre-Wa..."
latitude,40.7064,40.7613,40.7755,40.7984,40.7762
longitude,-74.0155,-73.9998,-73.9516,-73.9686,-73.9592
manager_id,e7023646cc4116c721919836cf77a298,9dabdb9265d1817435c38f5488d90141,256ef52932175829b363b4df8f6b81eb,62b685cc0d876c3a1a51d63a0d6a8082,9d32b720e26a351b951c8f78f72f2fec


In [4]:
# get a quick look at missing value amounts and data types
rent.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 15 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   bathrooms        20000 non-null  float64
 1   bedrooms         20000 non-null  int64  
 2   building_id      20000 non-null  object 
 3   created          20000 non-null  object 
 4   description      19417 non-null  object 
 5   display_address  19945 non-null  object 
 6   features         20000 non-null  object 
 7   latitude         20000 non-null  float64
 8   longitude        20000 non-null  float64
 9   manager_id       20000 non-null  object 
 10  photos           20000 non-null  object 
 11  price            20000 non-null  int64  
 12  street_address   19997 non-null  object 
 13  interest_level   20000 non-null  object 
 14  num_desc_words   20000 non-null  int64  
dtypes: float64(3), int64(3), object(9)
memory usage: 2.3+ MB


In [5]:
# shortcut to select numeric columns
rent_num = rent.select_dtypes(include='number')
rent_num.head().T

Unnamed: 0,0,1,2,3,4
bathrooms,2.0,1.0,1.0,1.0,2.0
bedrooms,2.0,0.0,2.0,1.0,2.0
latitude,40.7064,40.7613,40.7755,40.7984,40.7762
longitude,-74.0155,-73.9998,-73.9516,-73.9686,-73.9592
price,6495.0,2490.0,3400.0,3800.0,8500.0
num_desc_words,0.0,148.0,45.0,0.0,67.0


In [7]:
# check if there are any missing values
rent_num.isnull().any()

bathrooms         False
bedrooms          False
latitude          False
longitude         False
price             False
num_desc_words    False
dtype: bool

In [10]:
from sklearn.ensemble import RandomForestRegressor

X = rent_num.drop('price', axis=1)
y = rent_num['price']

oob_scores = []
for i in range(10):
    rf = RandomForestRegressor(oob_score=True)
    rf.fit(X, y)
    oob_scores.append(rf.oob_score_)

print("Mean OOB score:", sum(oob_scores) / len(oob_scores))


Mean OOB score: -0.38892975182726613


#### Explanation (5 marks)

Please provide an explanation and justification for the code submitted in **Part 1** in the context of the overall objective. 

We followed Sections 5.1 and 5.2 of the textbook (see link in Notes section above) to obtain a quick baseline for the objective of predicting rent prices as well as possible for typical New York City apartments. 

This baseline metric serves as a starting point to evaluate the performance of our model. By using a baseline metric, we can assess the effectiveness of our denoising efforts and determine if they provide better predictions than the baseline. This allows us to measure the improvement achieved by simply cleaning up our data.

>Note: We only needed to use the oob_score.

We calculate the oob_score 10 times so we get a better estimate. Everytime we create a random forest we will have different oob samples. It is possible that those samples represent extreme apartments that would either give a score that is too high or too low. To address this possibility, we repeat the process 10 times and take the average; the more times we do this the less likely our oob_score is to be impacted by these outliers. 

### Part 2 - Denoise the data

This section should only include the code necessary to **denoise** the data, NOT the code necessary to identify inconsistencies, problems, errors, etc. in the data. 

#### Code (25 marks)

In [12]:
rent_num.describe()

Unnamed: 0,bathrooms,bedrooms,latitude,longitude,price,num_desc_words
count,20000.0,20000.0,20000.0,20000.0,20000.0,20000.0
mean,1.2136,1.5459,39.66387,-64.604461,3620.118,-24464.10095
std,0.506693,1.115148,4.791731,41.332045,11441.24,154766.651412
min,0.0,0.0,0.0,-93.2705,-31500.0,-1000000.0
25%,1.0,1.0,40.7236,-73.9907,2450.0,46.0
50%,1.0,1.0,40.7486,-73.9765,3100.0,80.0
75%,1.0,2.0,40.7728,-73.9524,4066.25,117.0
max,7.0,8.0,44.8835,130.8419,1150000.0,663.0


### Issues

1. bathrooms = 0 (no bedroom is a studio apartment but no bathroom might mean a shared space, not an apartment)
2. bathrooms = 7 (what constitutes a typical apartment)
3. bedrooms = 8 (what constitutes a typical apartment)
4. latitude = 0 (that is the equator!)
5. latitude = 44.883500 (middle 50% of values very close to 40.7...)
6. longitude = -93.2705 (middle 50% of values very close to -73.9...)
7. longitude = 130.8419 (middle 50% of values very close to -73.9...)
8. negative prices
9. price = $1, 150,100 (what constitutes a typical apartment)
10. num_desc_words = -10000000.0 (doesn't seem right for counting and when compared to the other values in this column)

> Now explore these one at a time

In [32]:
# Apply fixes

rent_clean = rent_num[(rent_num['bathrooms'] > 0) & (rent_num['bathrooms'] < 4)]

rent_clean = rent_clean[(rent_clean['bedrooms'] < 4)]

rent_clean = rent_clean[(rent_clean['price']>1000) & (rent_clean['price']<10000)]

rent_clean = rent_clean[(rent_clean['latitude']>40.55) &
                    (rent_clean['latitude']<40.94) &
                    (rent_clean['longitude']>-74.1) &
                    (rent_clean['longitude']<-73.67)]

rent_clean = rent_clean[rent_clean['num_desc_words']>0]

rent_clean.describe()

Unnamed: 0,bathrooms,bedrooms,latitude,longitude,price,num_desc_words
count,15731.0,15731.0,15731.0,15731.0,15731.0,15731.0
mean,1.153519,1.41663,40.750756,-73.972461,3330.455534,93.309643
std,0.372809,0.971475,0.038718,0.029421,1293.089161,55.447003
min,1.0,0.0,40.5757,-74.094,1050.0,1.0
25%,1.0,1.0,40.72795,-73.9918,2450.0,54.0
50%,1.0,1.0,40.7518,-73.9782,3000.0,85.0
75%,1.0,2.0,40.7739,-73.9547,3895.0,121.0
max,3.5,3.0,40.9144,-73.7001,9995.0,497.0


### Part 3 - Create and evaluate a final model

#### Code (15 marks)

In [33]:
from sklearn.ensemble import RandomForestRegressor

X = rent_clean.drop('price', axis=1)
y = rent_clean['price']

oob_scores = []
for i in range(10):
    rf = RandomForestRegressor(oob_score=True)
    rf.fit(X, y)
    oob_scores.append(rf.oob_score_)

print("Mean OOB score:", sum(oob_scores) / len(oob_scores))

Mean OOB score: 0.7925646945939177


#### Explanation (5 marks)

Please provide an explanation and justification for the code submitted in **Part 3** in the context of the overall objective. 

We have now removed or cleaned up all the identified issues in the data and would like to see if it has any impact on the model's performance. Thus, we repeat exactly what we did in Part 1, just with the cleaned up data

### Part 4 - Document the problems (35 marks)

In this part, please use the table below to document your understanding of all the data issues you discovered. Note that **no code** should be included, as that should be covered in **Part 2**. Also, note that even if one line of code fixed a few problems, you should list each problem separately in the table below, so be sure you have investigated the data properly. For example, if the list `[-6, 5, 0, 50]` represents heights of adults, the -6, 0, and 50 would represent three data issues to be included in the table below, even though one line of code may be able to address all of them. 

| Data issue discovered | Why is this a problem? | How did you fix it? | Why is this fix appropriate? |
| :- | :- | :- | :- | 
|  example problem 1  | example explanation    | example fix  | example explanation about why this fix is appropriate   |
|  example problem 2  | example explanation    | example fix  | example explanation about why this fix is appropriate   |


> Issues 1 through 10 should appear in the table