## DAB200 -- Graded Lab 1

In this lab, you will gain some experience in **denoising** a dataset in the context of a specific objective.

**Overall Objective**: Create a model that predicts rent prices as well as possible for typical New York City apartments.

**Data set**: make sure you use the data with the same number as your group number!

| Group | Data set |
| :-: | :-: |
| 1 | rent_1.csv |
| 2 | rent_2.csv |
| etc. | etc. |

**Important Notes:**
 - This lab is more open-ended so be prepared to think on your own, in a logical way, in order to solve the problem at hand
     - You should be able to support any decision you make with logical evidence
 - The data looks like the data we have been using in class but it has other **surprises**
     - Be sure to investigate the data in a way that allows you to discover all these surprises
 - Use [Chapter 5](https://mlbook.explained.ai/prep.html) of the textbook as a **guide**, except:
     - you only need to use **random forest** models;
     - exclude Section 5.5;
 - Code submitted for this lab should be:
     - error free
         - to make sure this is the case, before submitting, close all Jupyter notebooks, exit Anaconda, reload the lab notebook and execute all cells
     - final code
         - this means that I don't want to see every piece of code you try as you work through this lab but only the final code; only the code that fulfills the objective
 - Use the **out-of-bag score** to evaluate models
     - Read Section 5.2 carefully so that you use this method properly
     - The oob score that you provide should be the average of 10 runs
 - Don't make assumptions!

I have broken the lab down into 4 main parts.

### Part 1 - Create and evaluate an initial model

#### Code (15 marks)

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt


In [None]:
df = pd.read_csv("https://raw.githubusercontent.com/Chhavinder058/Gradedlab1/main/rent_12.csv")

In [None]:
df.head(5)

Unnamed: 0,bathrooms,bedrooms,building_id,created,description,display_address,features,latitude,longitude,manager_id,photos,price,street_address,interest_level,num_desc_words
0,1.0,1,6257ec70258e72c2f9f32cb92f1d3449,6/16/2016 7:49,"Amazing modern rental, Central air and central...",North 12th Street,"['Elevator', 'Laundry in Building', 'New Const...",40.7199,-73.9538,5ba989232d0489da1b5f2c45f6688adc,['https://photos.renthop.com/2/7172218_c8961ee...,2850,210 North 12th Street,-3,51
1,1.0,2,e3c4f2223d1deb777fc7941dbb41047c,4/26/2016 2:38,"Spacious and bright full-floor two bedroom, 1....",East 55th Street,"['Laundry in Unit', 'Dishwasher', 'Hardwood Fl...",40.7593,-73.9689,76a7b8c8e01b7192330128f82a3445fb,['https://photos.renthop.com/2/6924750_ea06008...,4000,157 East 55th Street,1,182
2,2.0,2,4a4da2191c7545ceca3315b5a01c9f4a,5/11/2016 2:53,"Apartment Description:Enjoy beautiful, open ci...",One Columbus Place,"['Fitness Center', 'Residents Lounge', 'Childr...",40.7691,-73.9857,9df32cb8dda19d3222d66e69e258616b,['https://photos.renthop.com/2/6993774_a1c45e8...,5725,One Columbus Place,1,162
3,1.0,0,f2aac61e5e6f63cd9f2807f598c17f18,6/7/2016 3:50,,East 93rd Street,"['Cats Allowed', 'Dogs Allowed']",40.7833,-73.9511,e6472c7237327dd3903b3d6f6a94515a,['https://photos.renthop.com/2/7119477_d583830...,1995,188 East 93rd Street,1,0
4,1.0,1,766fee7eb49a9a822dae64dafb5d8320,6/9/2016 5:59,"8TH ST/ MERCER, PRIME GREENWICH VILLAGE, LARGE...",Mercer St,"['Swimming Pool', 'Roof Deck', 'Doorman', 'Ele...",40.7301,-73.9935,2a1ee03b449700c3a15dd8c9a505c525,['https://photos.renthop.com/2/7132122_e7f10e9...,3500,300 Mercer St,3,145


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 15 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   bathrooms        20000 non-null  float64
 1   bedrooms         20000 non-null  int64  
 2   building_id      20000 non-null  object 
 3   created          20000 non-null  object 
 4   description      19403 non-null  object 
 5   display_address  19944 non-null  object 
 6   features         20000 non-null  object 
 7   latitude         20000 non-null  float64
 8   longitude        20000 non-null  float64
 9   manager_id       20000 non-null  object 
 10  photos           20000 non-null  object 
 11  price            20000 non-null  int64  
 12  street_address   19996 non-null  object 
 13  interest_level   20000 non-null  int64  
 14  num_desc_words   20000 non-null  int64  
dtypes: float64(3), int64(4), object(8)
memory usage: 2.3+ MB


In [None]:
df1 = df[['bathrooms', 'bedrooms', 'longitude', 'latitude', 'price']]
df1.head(5)

Unnamed: 0,bathrooms,bedrooms,longitude,latitude,price
0,1.0,1,-73.9538,40.7199,2850
1,1.0,2,-73.9689,40.7593,4000
2,2.0,2,-73.9857,40.7691,5725
3,1.0,0,-73.9511,40.7833,1995
4,1.0,1,-73.9935,40.7301,3500


In [None]:
print(df1.isnull().any())

bathrooms    False
bedrooms     False
longitude    False
latitude     False
price        False
dtype: bool


In [None]:
df1.shape

(20000, 5)

In [None]:
X_train = df1.drop('price', axis=1)
y_train = df1['price']

In [None]:
from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor(n_estimators=100,n_jobs=-1)

In [None]:
print(X_train.head())

   bathrooms  bedrooms  longitude  latitude
0        1.0         1   -73.9538   40.7199
1        1.0         2   -73.9689   40.7593
2        2.0         2   -73.9857   40.7691
3        1.0         0   -73.9511   40.7833
4        1.0         1   -73.9935   40.7301


In [None]:
rf.fit(X_train, y_train)

In [None]:
r2 = rf.score(X_train, y_train)
print( f"{r2:.4f}" )

0.9449


In [None]:
rf = RandomForestRegressor(n_estimators=100,n_jobs=-1,oob_score=True)   # get error estimate
rf.fit(X_train, y_train)
noisy_oob_r2 = rf.oob_score_
print(f"OOB score {noisy_oob_r2:.4f}")

OOB score 0.4428


In [None]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error

X, y = df1.drop('price', axis=1), df1['price']

errors = []
oobs = []
print(f"Validation MAE trials:", end='')
for i in range(10):
    X_train, X_test, y_train, y_test = \
        train_test_split(X, y, test_size=0.20)
    rf = RandomForestRegressor(n_estimators=100, n_jobs=-1, oob_score = True)
    rf.fit(X_train, y_train)
    y_predicted = rf.predict(X_test)
    e = mean_absolute_error(y_test, y_predicted)
    o = rf.oob_score_
    print(f" ${e:.0f}", end='')
    errors.append(e)
    oobs.append(o)
print()
noisy_avg_mae = np.mean(errors)
oob_score_avg = np.mean(oobs)
print(f"Average validation MAE ${noisy_avg_mae:.0f}")
print(f"Average oob score {oob_score_avg:.4f}")

Validation MAE trials: $791 $841 $894 $935 $1164 $788 $835 $940 $853 $1015
Average validation MAE $906
Average oob score 0.3509


#### Explanation (5 marks)

Please provide an explanation and justification for the code submitted in **Part 1** in the context of the overall objective.

In the first step, we imported the data using pd.read_csv(), then proceeded to examine it. Given our intention to utilize the Random Forest Regression model, we selected numeric variables and checked for any missing values within them. It's crucial to address null values as they can pose challenges when implementing machine learning algorithms.

- Initially, a basic random forest regression model is applied to the entire dataset using labeled data, where X denotes the independent variables and Y, specifically the 'price' column, denotes the dependent variable. The $R^2$ score indicates a robust relationship between the two variables.

- Next, we computed the average price for the apartments across the entire dataset. Using this average, we calculated the Mean Absolute Error (MAE) relative to the average rent. This allows us to later compare the performance of our model against this baseline.

- Following that, we divided the dataset into training and testing samples using a 4:1 ratio. This step is essential for evaluating the accuracy of the model.

- The RandomForestRegressor is invoked with 100 decision trees, utilizing all available cores to train the dataset on bootstrap samples. This approach introduces randomness, enhancing the model's generality and robustness.

- The model is trained, and subsequently, predictions are made for Y, which are then compared with the test Y to compute the Mean Absolute Error (MAE). Additionally, the out-of-bag (OOB) score for the model is calculated.

- The model is iterated 10 times, and the average MAE and OOB score for these 10 runs are computed. This process provides an indication of the model's effectiveness on a noisy dataset, with the final results being displayed.

- The average OOB score and MAE indicate poor performance, suggesting that the dataset needs to be denoised.

### Part 2 - Denoise the data

This section should only include the code necessary to **denoise** the data, NOT the code necessary to identify inconsistencies, problems, errors, etc. in the data.

#### Code (25 marks)

In [None]:
# filtered out the data based on the price which doesn't lie in the range [1000,10000].
# Note that it automatically takes care of negative prices listed in the dataset
df_clean = df1[(df1.price>1_000) & (df1.price<10_000)]

In [None]:
# filtered out the latitude and longitude values of 0
df_clean = df_clean[(df_clean.longitude!=0) | (df_clean.latitude!=0)]

In [None]:
# filtered out the latitude and longitude values which doesn't belong to that of NY.

df_clean = df_clean[(df_clean['latitude']>40.55) &
                    (df_clean['latitude']<40.94) &
                    (df_clean['longitude']>-74.1) &
                    (df_clean['longitude']<-73.67)]

In [None]:
# excluding the datapoints where number of bathrooms is zero
df_clean = df_clean[(df_clean['bathrooms']!=0)]

### Part 3 - Create and evaluate a final model

#### Code (15 marks)

In [None]:
X = df_clean.drop('price', axis=1)
y = df_clean['price']
errors = []
oob_score = []
print(f"Validation MAE trials:", end='')
for i in range(10):
    X_train, X_test, y_train, y_test = \
        train_test_split(X, y, test_size=0.20)
    rf = RandomForestRegressor(n_estimators=100, n_jobs=-1, oob_score = True)
    rf.fit(X_train, y_train)
    y_predicted = rf.predict(X_test)
    e = mean_absolute_error(y_test, y_predicted)
    oob =  rf.oob_score_
    print(f" ${e:.0f}", end=' \n')
    print(f" {oob:.4f}", end='\n')
    errors.append(e)
    oob_score.append(oob)
print()
noisy_avg_mae = np.mean(errors)
avg_oob = np.mean(oob_score)
print(f"Average validation MAE ${noisy_avg_mae:.0f}")
print(f"Oob score avg {avg_oob:.4f}")

Validation MAE trials: $338 
 0.8214
 $339 
 0.8183
 $359 
 0.8225
 $354 
 0.8187
 $344 
 0.8198
 $353 
 0.8204
 $353 
 0.8220
 $347 
 0.8203
 $350 
 0.8130
 $356 
 0.8235

Average validation MAE $349
Oob score avg 0.8200


#### Explanation (5 marks)

Please provide an explanation and justification for the code submitted in **Part 3** in the context of the overall objective.

#### Random Forest Regressor

We opted for the Random Forest Regressor from the scikit-learn library because we're dealing with predicting rent prices, which is a continuous variable, making it a regression problem.

##### Explanation and Justification of code

- The denoised data is represented by df_clean, and to separate the independent variables, we drop the 'price' column using df_clean.drop('price', axis=1), storing the result in the variable X. Similarly, the dependent variable is represented by the 'price' column of the dataset. Labeling the data is a fundamental step in any supervised machine learning model, thus essential for the random forest regression analysis.

- Two empty lists were created to store the Mean Absolute Error and OOB score obtained during different iterations of the model run.

- A for loop is utilized to execute the model 10 times, as per the instructions.

- The data is divided into training and testing sets in an 80:20 ratio, represented by X_train, Y_train, X_test, and Y_test. This division is necessary to assess the accuracy of the model constructed on untrained data. The model is trained using 80% of the data (X_train and Y_train), while the remaining 20% (X_test and Y_test) is used for accuracy testing.

- The RandomForestRegressor model is instantiated and stored in a variable named rf. In the parameters, the number of decision trees built is set to 100 using n_estimators, and all CPU cores are utilized to parallelize the training process, denoted by n_jobs=-1. This speeds up the training process. Additionally, oob_score is set to True to enable bootstrap sampling and calculate the out-of-bag (OOB) score for each iteration. Setting it to False would train the decision trees on the entire dataset instead of bootstrap samples, reducing the randomness and hence the generality of the model.

- rf.fit(X_train, Y_train) trains the RandomForestRegressor model using the training data X_train and corresponding labels Y_train.

- rf.predict(X_test) generates predictions for the dependent variable, which are stored in Y_predicted. These predictions are then compared with Y_test to calculate the Mean Absolute Error (MAE). Similarly, rf.oob_score_ calculates the OOB score for the model. For each iteration, the MAE and OOB score are stored in the variables 'e' and 'oob', respectively. These values are then averaged, and the averages are displayed as the final output.

- The MAE of $349 and OOB score of 0.820 indicate an improvement over the previous model trained on the raw dataset. This suggests that denoising the data has resulted in a more accurate model.

### Part 4 - Document the problems (35 marks)

In this part, please use the table below to document your understanding of all the data issues you discovered. Note that **no code** should be included, as that should be covered in **Part 2**. Also, note that even if one line of code fixed a few problems, you should list each problem separately in the table below, so be sure you have investigated the data properly. For example, if the list `[-6, 5, 0, 50]` represents heights of adults, the -6, 0, and 50 would represent three data issues to be included in the table below, even though one line of code may be able to address all of them.







| Data issue discovered | Why is this a problem? | How did you fix it? | Why is this fix appropriate? |
| :- | :- | :- | :- |
|  Negative price values  | If the data is skewed, it can indeed affect the mean and subsequently influence the learning patterns of the machine learning model, potentially leading to incorrect predictions during model building. Skewed data may bias the model towards the majority class or dominant patterns, causing it to overlook important but less frequent patterns or classes. It's essential to address skewness in the data through techniques such as data transformation or resampling to ensure a balanced representation of all classes or patterns, thereby improving the model's accuracy and generalization ability.| Filter out the price values from dataframe  | Rent can't be negative   |
|  Price less than \\$ 1000  | This will influence the statistics of the dataset. Dataset can become skewed, mean is effected and dosen't represent the actual average pricing of the apartments. Moreover, the machine laerning model will be affected by lower rents which will reflect in the poor preidction of the model    | Filter out the price values from dataframe  | It is not practical to find houses with rent less than \\$1000   |
|  Price more than \\$10000  |This will influence the dataset's statistics, potentially causing skewness and affecting the mean, leading to a misrepresentation of the actual average apartment prices. Additionally, the machine learning model may be biased towards higher rents, resulting in poor prediction accuracy.| Filter out the price values from dataframe  | It is not practical to find houses with rent more than \\$10000 |
|  Latitude and Longitude values of zero  | This will wrongly relate spatial co-ordinates with prices, leading to incorrect conclusions about trends in rent prices    | Filter out the zero values of latitude and longitude  | Latitude and Longitude of values 0 lie on equator in the Atlantic Ocean near the coast of Africa which is obviously far away.   |
| Latitude and Longitude values which doesn't belong to city Latitude and Longitude co-ordinates  |   This will inaccurately associate spatial coordinates with prices, resulting in incorrect conclusions about rent price trends.  | Filter out the values of latitude and longitude that doesn't belong to city | This is an error while collecting or preparing the dataset. Some co-ordinates lie outside the city which may belong to nearby places outside the city or faraway palces in some other states depending upon the co-ordinate. |
|  Number of bathrooms is zero  | Datapoints with no bathrooms will   distort summary statistics, such as the average number of bathrooms, leading to misleading insights about the rental properties. Zero values can mislead the model, making it harder to learn accurate relationships between features and rent prices. Zero values may skew the importance of bathroom number as a feature leading model to predict inaccurately.  | Filter out the zero bathroom values  | No bathrooms in a rental apartment is impractical. Nobody would like to have no bathroom apartment   |

