## DAB200 -- Graded Lab 1

In this lab, you will gain some experience in **denoising** a dataset in the context of a specific objective. 

**Overall Objective**: Create a model that predicts rent prices as well as possible for typical New York City apartments.

**Data set**: make sure you use the data with the same number as your group number!

| Group | Data set |
| :-: | :-: |
| 1 | rent_1.csv |
| 2 | rent_2.csv |
| etc. | etc. |

**Important Notes:**
 - This lab is more open-ended so be prepared to think on your own, in a logical way, in order to solve the problem at hand
     - You should be able to support any decision you make with logical evidence
 - The data looks like the data we have been using in class but it has other **surprises**
     - Be sure to investigate the data in a way that allows you to discover all these surprises
 - Use [Chapter 5](https://mlbook.explained.ai/prep.html) of the textbook as a **guide**, except:
     - you only need to use **random forest** models;
     - exclude Section 5.5; 
 - Code submitted for this lab should be:
     - error free
         - to make sure this is the case, before submitting, close all Jupyter notebooks, exit Anaconda, reload the lab notebook and execute all cells
     - final code
         - this means that I don't want to see every piece of code you try as you work through this lab but only the final code; only the code that fulfills the objective
 - Use the **out-of-bag score** to evaluate models
     - Read Section 5.2 carefully so that you use this method properly
     - The oob score that you provide should be the average of 10 runs
 - Don't make assumptions!

I have broken the lab down into 4 main parts. 

### Part 0

Please provide the following information:
 - Group Number: 13
 - Group Members
     - Name (Student ID) : Dharmik Patel    Id : 0813537
     - Name (Student ID) : Harshil Patel    Id : 0801869
     - Name (Student ID) : Deep Cha         Id : 0813502

     

### Part 1 - Create and evaluate an initial model

#### Code (15 marks)

In [5]:
# Importing the required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_absolute_error

# Loading the dataset
df = pd.read_csv('rent_13.csv')
df.head(3)

Unnamed: 0,bathrooms,bedrooms,building_id,created,description,display_address,features,latitude,longitude,manager_id,photos,price,street_address,interest_level,num_desc_words,mgr_apt_count
0,1.0,3,80911ce8a425daf4989ea8a4bccc41a7,2016-04-14 02:28:38,AMAZING 3 BEDROOMS FLEX LOCATED IN THE MOST CO...,6 Ave.,"['Roof Deck', 'Doorman', 'Elevator', 'Laundry ...",40.7514,-73.9862,c44ea3e83ff2048561ab5c21a2d0c90e,['https://photos.renthop.com/2/6870393_69866e0...,4400,990 6 Ave.,low,100,6
1,1.0,2,7967a1280bf3f7644500fc79d2696b0e,2016-05-07 02:56:33,"Midtown 2 bedroom with Elevator, laundry and f...",W 45th St,"['Cats Allowed', 'Dogs Allowed', 'Doorman', 'E...",40.7601,-73.99,8005d4c588f87fe6709c67918509adeb,['https://photos.renthop.com/2/6978575_a898aed...,3450,341 W 45th St,low,31,2
2,1.0,1,c94301249b8c09429d329864d58e5b82,2016-04-14 02:36:37,BEAUTIFUL DOORMAN BUILDING IN FINANCIAL DISTRI...,Gold St.,"['Swimming Pool', 'Roof Deck', 'Balcony', 'Doo...",40.7074,-74.0069,dc76cf7e3a02bbbe11b2cfb833544b89,['https://photos.renthop.com/2/6870551_43b0030...,3490,2 Gold St.,low,61,2


In [25]:
# Checking the data types of the columns
df.dtypes

# Extracting all the numerical columns
df_num = df.select_dtypes(include = np.number)

# remove the Unnamed: 0 and num_desc_words columns
df_num = df_num.drop(['num_desc_words'], axis = 1)

# Convert the numerical columns to dataframe
df_num = pd.DataFrame(df_num)

# Checking numerical columns for missing values
df_num.isnull().sum()

bathrooms        0
bedrooms         0
latitude         0
longitude        0
price            0
mgr_apt_count    0
dtype: int64

This was our initial model with random state so that all of us could re-produce the same results for the lab
Below this is the model with no random state. We used for loop to run it 10 times and stored the results in the list.
After that we printed the average OOB score


In [26]:
# Creating a initial model with all the numerical columns to predict the price
X1 = df_num.drop('price', axis = 1)
y1 = df_num['price']

# split the data into train and test
X1_train, X1_test, y1_train, y1_test = train_test_split(X1, y1, test_size = 0.2, random_state = 42)

# Creating a random forest regressor model with oob_score=True, n_estimators=100, random_state=42, n_jobs=-1
regr1 = RandomForestRegressor(oob_score=True, n_estimators=100, random_state=42, n_jobs=-1)

# Fitting the model
regr1.fit(X1_train, y1_train)

# Predicting the price on the test data
y1_pred = regr1.predict(X1_test)

In [27]:
# print the r2_score for both train and test data with 3 decimal places
print('r2_score for train data: ', round(r2_score(y1_train, regr1.predict(X1_train)), 3))
print('r2_score for test data: ', round(r2_score(y1_test, y1_pred), 3))

# print the OOb score
print('OOB score: ', round(regr1.oob_score_, 3))

r2_score for train data:  0.89
r2_score for test data:  0.0
OOB score:  0.316


In [28]:
# Using a for loop that runs 10 times and storing the r2_score for each iteration for test and train data in a list as well as OOBScore
regr1_r2_train = []
regr1_r2_test = []
regr1_oob_score = []

# starting the for loop
for i in range(10):
    X1_train, X1_test, y1_train, y1_test = train_test_split(X1, y1, test_size = 0.2)
    regr1 = RandomForestRegressor(oob_score=True, n_estimators=100, n_jobs=-1)
    regr1.fit(X1_train, y1_train)
    y1_pred = regr1.predict(X1_test)
    regr1_r2_train.append(round(r2_score(y1_train, regr1.predict(X1_train)), 3))
    regr1_r2_test.append(round(r2_score(y1_test, y1_pred), 3))
    regr1_oob_score.append(round(regr1.oob_score_, 3))

In [29]:
# print the list of the r2_score for both train and test data with 3 decimal places
print('r2_score for train data: ', regr1_r2_train)
print(" ")
print('r2_score for test data: ', regr1_r2_test)
print(" ")
# print the list of the OOb score
print('OOB score: ', regr1_oob_score)

r2_score for train data:  [0.841, 0.906, 0.897, 0.884, 0.848, 0.859, 0.858, 0.848, 0.852, 0.901]
 
r2_score for test data:  [-22.935, -0.397, 0.088, -0.02, -0.525, 0.0, 0.011, 0.006, -0.002, -0.859]
 
OOB score:  [-0.145, -0.049, -0.036, -0.02, -0.063, -0.114, -0.017, 0.0, -0.064, -0.017]


In [30]:
# The oob score and R-squared that you provide should be the average of the 10 runs
print('Average OOB score: ', round(np.mean(regr1_oob_score), 3))
print("")
print('Average R-squared for train data: ', round(np.mean(regr1_r2_train), 3))
print("")
print('Average R-squared for test data: ', round(np.mean(regr1_r2_test), 3))


Average OOB score:  -0.053

Average R-squared for train data:  0.869

Average R-squared for test data:  -2.463


#### Explanation (5 marks)

Please provide an explanation and justification for the code submitted in **Part 1** in the context of the overall objective. 

### Part 2 - Denoise the data

This section should only include the code necessary to **denoise** the data, NOT the code necessary to identify inconsistencies, problems, errors, etc. in the data. 

#### Code (25 marks)

In [31]:
pd.options.display.float_format = '{:20,.2f}'.format

In [32]:
# describe the dataframe
df_num.describe()

Unnamed: 0,bathrooms,bedrooms,latitude,longitude,price,mgr_apt_count
count,20000.0,20000.0,20000.0,20000.0,20000.0,20000.0
mean,1.21,1.54,39.41,-67.62,3755.1,81.24
std,0.5,1.11,6.76,28.31,32725.22,240.64
min,0.0,0.0,-1.14,-118.27,-90000.0,-300.0
25%,1.0,1.0,40.72,-73.99,2425.0,7.0
50%,1.0,1.0,40.75,-73.98,3100.0,18.0
75%,1.0,2.0,40.77,-73.95,4025.5,50.0
max,10.0,8.0,44.88,77.13,4490000.0,1064.0


In [33]:
# creating a new data frame same as df_1678_num
df_new = df_num.copy()

In [35]:
# checking the price column anomalies such as price < 0 and price > 10000
df_new[(df_new['price'] < 0) | (df_new['price'] > 10000)]

Unnamed: 0,bathrooms,bedrooms,latitude,longitude,price,mgr_apt_count
15,1.00,1,40.77,-73.99,-3175,8
36,2.00,2,40.74,-73.99,-9000,2
58,1.00,3,40.76,-73.99,-3995,1064
65,1.00,3,40.82,-73.95,-2500,54
84,1.00,3,40.82,-73.96,-3500,2
...,...,...,...,...,...,...
19895,3.00,6,40.74,-73.99,11500,51
19918,1.00,1,40.73,-74.00,-2800,3
19921,2.00,3,40.79,-73.95,-2550,55
19946,1.00,1,40.71,-74.01,-2995,17


In [38]:
# removing the valuess less than 1000 and greater than 10000 in price column
df_new = df_new[(df_new['price'] > 1000) & (df_new['price'] < 10000)]

In [39]:
df_new.head(5)

Unnamed: 0,bathrooms,bedrooms,latitude,longitude,price,mgr_apt_count
0,1.0,3,40.75,-73.99,4400,6
1,1.0,2,40.76,-73.99,3450,2
2,1.0,1,40.71,-74.01,3490,2
3,1.0,1,40.77,-73.95,3150,39
4,1.0,2,40.78,-73.95,3100,10


In [40]:
# checking the price column anomalies such as price < 0 and price > 10000
df_new[(df_new['price'] < 0) | (df_new['price'] > 10000)].count()

bathrooms        0
bedrooms         0
latitude         0
longitude        0
price            0
mgr_apt_count    0
dtype: int64

In [44]:
# Checking the for 0 in latitude and longitude columns
df_new[(df_new['latitude'] == 0) | (df_new['longitude'] == 0)]

Unnamed: 0,bathrooms,bedrooms,latitude,longitude,price,mgr_apt_count
2221,1.0,2,0.0,0.0,3200,2
3819,1.0,1,0.0,0.0,3495,8
4177,5.0,6,0.0,0.0,9995,13
4986,1.0,2,0.0,0.0,3200,2
8100,1.0,1,0.0,0.0,1725,1064
19550,1.0,1,0.0,0.0,1750,1


In [45]:
# removing everything from longitude except -74.1 to -73.67 and latitude from 40.55 to 40.94
df_new = df_new[(df_new['longitude'] > -74.1) & (df_new['longitude'] < -73.67) & 
    (df_new['latitude'] > 40.55) & (df_new['latitude'] < 40.94)]

In [47]:
# describe the dataframe
df_new.describe()

Unnamed: 0,bathrooms,bedrooms,latitude,longitude,price,mgr_apt_count
count,18127.0,18127.0,18127.0,18127.0,18127.0,18127.0
mean,1.18,1.51,40.75,-73.97,3432.03,82.16
std,0.43,1.09,0.04,0.03,1396.89,242.79
min,0.0,0.0,40.58,-74.09,1034.0,-300.0
25%,1.0,1.0,40.73,-73.99,2475.0,7.0
50%,1.0,1.0,40.75,-73.98,3100.0,18.0
75%,1.0,2.0,40.77,-73.95,4000.0,48.0
max,10.0,8.0,40.91,-73.7,9999.0,1064.0


In [41]:
# Checking the bedrooms column anomalies such as bedrooms < 0 and bedrooms > 10
df_new[(df_new['bedrooms'] < 0) | (df_new['bedrooms'] > 10)]

Unnamed: 0,bathrooms,bedrooms,latitude,longitude,price,mgr_apt_count


In [43]:
# checking the bathrooms column anomalies such as bathrooms < 0 and bathrooms > 10
df_new[(df_new['bathrooms'] < 0) | (df_new['bathrooms'] > 10)]

Unnamed: 0,bathrooms,bedrooms,latitude,longitude,price,mgr_apt_count


In [48]:
# describe the dataframe
df_new.describe()

Unnamed: 0,bathrooms,bedrooms,latitude,longitude,price,mgr_apt_count
count,18127.0,18127.0,18127.0,18127.0,18127.0,18127.0
mean,1.18,1.51,40.75,-73.97,3432.03,82.16
std,0.43,1.09,0.04,0.03,1396.89,242.79
min,0.0,0.0,40.58,-74.09,1034.0,-300.0
25%,1.0,1.0,40.73,-73.99,2475.0,7.0
50%,1.0,1.0,40.75,-73.98,3100.0,18.0
75%,1.0,2.0,40.77,-73.95,4000.0,48.0
max,10.0,8.0,40.91,-73.7,9999.0,1064.0


### Part 3 - Create and evaluate a final model

#### Code (15 marks)

In [49]:
# creating a new model with df_1678_num_new
X2 = df_new.drop('price', axis = 1)
y2 = df_new['price']

# split the data into train and test
X2_train, X2_test, y2_train, y2_test = train_test_split(X2, y2, test_size = 0.2, random_state = 42)

# Creating a random forest regressor model with oob_score=True, n_estimators=100, random_state=42, n_jobs=-1
regr2 = RandomForestRegressor(oob_score=True, n_estimators=100, random_state=42, n_jobs=-1)

# Fitting the model
regr2.fit(X2_train, y2_train)

# Predicting the price on the test data
y2_pred = regr2.predict(X2_test)

In [50]:
# print the r2_score for both train and test data with 3 decimal places
print('r2_score for train data: ', round(r2_score(y2_train, regr2.predict(X2_train)), 3))
print('r2_score for test data: ', round(r2_score(y2_test, y2_pred), 3))

# print the OOb score till 3 decimal places
print('OOB score: ', round(regr2.oob_score_, 3))

r2_score for train data:  0.971
r2_score for test data:  0.81
OOB score:  0.802


In [51]:
# Using a for loop that runs 10 times and storing the r2_score for each iteration for test and train data in a list as well as OOBScore
regr2_r2_train = []
regr2_r2_test = []
regr2_oob_score = []

# starting the for loop
for i in range(10):
    X2_train, X2_test, y2_train, y2_test = train_test_split(X2, y2, test_size = 0.2)
    regr2 = RandomForestRegressor(oob_score=True, n_estimators=100, n_jobs=-1)
    regr2.fit(X2_train, y2_train)
    y2_pred = regr2.predict(X2_test)
    regr2_r2_train.append(round(r2_score(y2_train, regr2.predict(X2_train)), 3))
    regr2_r2_test.append(round(r2_score(y2_test, y2_pred), 3))
    regr2_oob_score.append(round(regr2.oob_score_, 3))


In [52]:
# print the list of the r2_score for both train and test data with 3 decimal places
print('r2_score for train data: ', regr2_r2_train)
print(" ")
print('r2_score for test data: ', regr2_r2_test)
print(" ")
# print the list of the OOb score
print('OOB score: ', regr2_oob_score)

r2_score for train data:  [0.971, 0.97, 0.971, 0.971, 0.97, 0.971, 0.971, 0.971, 0.972, 0.972]
 
r2_score for test data:  [0.816, 0.823, 0.803, 0.802, 0.815, 0.788, 0.797, 0.796, 0.795, 0.8]
 
OOB score:  [0.803, 0.798, 0.806, 0.803, 0.799, 0.804, 0.806, 0.804, 0.809, 0.809]


In [53]:
# The oob score and R-squared that you provide should be the average of the 10 runs
print('Average OOB score: ', round(np.mean(regr2_oob_score), 3))
print("")
print('Average R-squared for train data: ', round(np.mean(regr2_r2_train), 3))
print("")
print('Average R-squared for test data: ', round(np.mean(regr2_r2_test), 3))

Average OOB score:  0.804

Average R-squared for train data:  0.971

Average R-squared for test data:  0.804


#### Explanation (5 marks)

Please provide an explanation and justification for the code submitted in **Part 3** in the context of the overall objective. 

### Part 4 - Document the problems (35 marks)

In this part, please use the table below to document your understanding of all the data issues you discovered. Note that **no code** should be included, as that should be covered in **Part 2**. Also, note that even if one line of code fixed a few problems, you should list each problem separately in the table below, so be sure you have investigated the data properly. For example, if the list `[-6, 5, 0, 50]` represents heights of adults, the -6, 0, and 50 would represent three data issues to be included in the table below, even though one line of code may be able to address all of them. 

| Data issue discovered | Why is this a problem? | How did you fix it? | Why is this fix appropriate? |
| :- | :- | :- | :- | 
|  example problem 1  | example explanation    | example fix  | example explanation about why this fix is appropriate   |
|  example problem 2  | example explanation    | example fix  | example explanation about why this fix is appropriate   |
