## COMP2006 -- Graded Lab 1

In this lab, you will gain some experience in **denoising** a dataset in the context of a specific objective. 

**Overall Objective**: Create a model that predicts rent prices as well as possible for typical New York City apartments.

**Data set**: make sure you use the data with the same number as your group number!

| Group | Data set |
| :-: | :-: |
| 1 | rent_1.csv |
| 2 | rent_2.csv |
| etc. | etc. |

**Important Notes:**
 - This lab is more open-ended so be prepared to think on your own, in a logical way, in order to solve the problem at hand
     - You should be able to support any decision you make with logical evidence
 - The data looks like the data we have been using in class but it has other **surprises**
     - Be sure to investigate the data in a way that allows you to discover all these surprises
 - Use [Chapter 5](https://mlbook.explained.ai/prep.html) of the textbook as a **guide**, except:
     - you only need to use **random forest** models;
     - exclude Section 5.5; 
 - Code submitted for this lab should be:
     - error free
         - to make sure this is the case, before submitting, close all Jupyter notebooks, exit Anaconda, reload the lab notebook and execute all cells
     - final code
         - this means that I don't want to see every piece of code you try as you work through this lab but only the final code; only the code that fulfills the objective
 - Use the **out-of-bag score** to evaluate models
     - Read Section 5.2 carefully so that you use this method properly
     - The oob score that you provide should be the average of 10 runs
 - Don't make assumptions!

I have broken the lab down into 4 main parts. 

### Part 0

Please provide the following information:
 - Group Number: 7
 - Group Members
     - Manuel Bishop Noriega 4362207
     - Robert E. Matney III 4364229

     

### Part 1 - Create and evaluate an initial model

#### Code (15 marks)

In [1]:
# importing panda to the code, to use dataframes.
import pandas as pd

# getting the csv file rent opened in a variable holding a dataframe
df = pd.read_csv("rent_7.csv")

# getting how many vectors/records of data are in file
print(df.shape)

# creating table with just these columns of data with numbers
# it would also be possible to convert non-numeric data columns to numbers and added if relevant
df_num = df[['bathrooms', 'bedrooms', 'longitude', 'latitude', 'price']]

# separating features vector and target columns
X_train = df_num.drop('price', axis=1)
y_train = df_num['price']

# creating an appropriate model with suitable hyper-parameters
from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor(n_estimators = 100, n_jobs = -1)

# fit model to the training data
rf.fit(X_train, y_train)

# getting a relationship between X_train and y_train
r2 = rf.score(X_train, y_train)
print(f"{r2:.4f}")

rf = RandomForestRegressor(n_estimators = 100, n_jobs = -1, oob_score = True)

rf.fit(X_train, y_train)
noisy_oob_r2 = rf.oob_score_
print(f"OOB score {noisy_oob_r2:.4f}")

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
import numpy as np

X, y = df_num.drop('price', axis=1), df_num['price']

errors = []
print(f"Validation MAE trials: ", end='')
for i in range(7):
    X_train, X_test, y_train, y_test = \
        train_test_split(X, y, test_size=0.20)
    rf = RandomForestRegressor(n_estimators=100, n_jobs=-1)
    rf.fit(X_train, y_train)
    y_predicted = rf.predict(X_test)
    e = mean_absolute_error(y_test, y_predicted)
    print(f"${e:.0f} ", end='')
    errors.append(e)
print()
noisy_avg_mae = np.mean(errors)
print(f"Average validation MAE ${noisy_avg_mae:.0f}")

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


(20000, 16)
0.0698
OOB score -0.0070
Validation MAE trials: $1317 $964 $961 $1022 $941 $1038 $973 
Average validation MAE $1031


#### Explanation (5 marks)

Please provide an explanation and justification for the code submitted in **Part 1** in the context of the overall objective. 

To start with the was getting the number of data entery rows and columns in the csv file.

Then we are training the machine based on the data that is given to look for a type of outcome. Trying to preform a model to help with trying to find a relationship between all of the data. R2 low score suggested no relationship found or noisiy data. We proceeded to confirm this in the next steps.

Next we are trying to use the model and what the machine has learned from the model to predict the average apartmet rent price. 

Then we did several valiation trails for get different amount of data to get the MAE.

From the data that is collected we get a MAE from that.

Results from oob score and MAE average validation showed our model performs really bad making predictions. But before concluding there's no relationship between our features vector and target we should make sure our data set is free of inconsistencies, errors or outliers.

### Part 2 - Denoise the data

This section should only include the code necessary to **denoise** the data, NOT the code necessary to identify inconsistencies, problems, errors, etc. in the data. 

#### Code (25 marks)

In [2]:
# DENOISING DATA, next two lines get rid of outliers or errors regarding apts location and prices.
#narrowing prices range to consider only reasonable ones (excludes also negative values)
df_clean = df_num[(df_num.price>1_000) & (df_num.price<10_000)]


#delimiting coverage area to new york city only.
df_clean = df_clean[(df_clean['latitude']>40.55) &
        (df_clean['latitude']<40.94) &
        (df_clean['longitude']>-74.1) &
        (df_clean['longitude']<-73.67)]

### Part 3 - Create and evaluate a final model

#### Code (15 marks)

In [6]:
X, y = df_clean.drop('price', axis=1), df_clean['price']
rf = RandomForestRegressor(n_estimators=100, n_jobs=-1, #parallelize
                    oob_score=True) # get error estimate
rf.fit(X, y)
clean_oob_r2 = rf.oob_score_
print(f"Validation OOB score {clean_oob_r2: 4f}")

errors = []
print(f"Valiation MAE trials:", end='')
for i in range(7):
    X_train, X_test, y_train, y_test = \
        train_test_split(X, y, test_size=0.20)
    rf = RandomForestRegressor(n_estimators=100, n_jobs=-1)
    rf.fit(X_train, y_train)
    y_predicted = rf.predict(X_test)
    e = mean_absolute_error(y_test, y_predicted)
    print(f"${e:.0f} ", end='')
    errors.append(e)
print()
noisy_avg_mae = np.mean(errors)
print(f"Average validation MAE ${noisy_avg_mae:.0f}")

from sklearn.linear_model import Lasso

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)
lm = Lasso(alpha=0.5) # create linear model
lm.fit(X_train, y_train)
print(f"LM Training score {lm.score(X_train, y_train):.4f}")
print(f"LM Validation score {lm.score(X_test, y_test):.4f}")

from sklearn.ensemble import GradientBoostingRegressor

gbr = GradientBoostingRegressor(n_estimators = 2000)
gbr.fit(X_train, y_train)
print(f"GB Training score {gbr.score(X_train, y_train):.4f}")
print(f"GB Validation score {gbr.score(X_test, y_test):.4f}")

Validation OOB score  0.820219
Valiation MAE trials:$356 $356 $359 $349 $350 $358 $351 
Average validation MAE $354
LM Training score 0.5709
LM Validation score 0.5687
GB Training score 0.8607
GB Validation score 0.7992


#### Explanation (5 marks)

Please provide an explanation and justification for the code submitted in **Part 3** in the context of the overall objective. 

Now we run our model with denoised dataset (in part 2) to see how much its scores are improved. Results are way better than before, next, we'll compared how this models performs compared to other 2 models

Then the Lasso is a linear model to see if it will pull the data in a better way and see it if gives a better score. It does not.

Then we tried the gradient boosting model to see if it would give a different score for the data.  It did not do any better either.

### Part 4 - Document the problems (35 marks)

In this part, please use the table below to document your understanding of all the data issues you discovered. Note that **no code** should be included, as that should be covered in **Part 2**. Also, note that even if one line of code fixed a few problems, you should list each problem separately in the table below, so be sure you have investigated the data properly. For example, if the list `[-6, 5, 0, 50]` represents heights of adults, the -6, 0, and 50 would represent three data issues to be included in the table below, even though one line of code may be able to address all of them. 

| Data issue discovered | Why is this a problem? | How did you fix it? | Why is this fix appropriate? |
| :- | :- | :- | :- | 
| example problem 1: The longitude and Latitude both have some data entry that have 0.  | Which is this impossible because if it was the case then those apartments would not be in the area of New York but somewhere out in the Ocean or the equator. |  So, what needs to be done is to find all the data entry sets that have either the longitude or latitude set as 0 (or other outliers) and remove them from the model. | This fix is appropriate because those data sets with those number could be good data in the rest of the columns. The data could have been entered wrong or mistyped. |
|  example problem 2: The such degree in the price being so low to the price being so high to get the mead of the price of apartments rentals. | The problem with there be extreme low prices and extreme high prices is that it throws off the mean average price for the apartments rentals.    | So in order to fix this we filtered the data to get only the records with appartment prices between $1,000 and $10,000. | What this will do is help remove any mistakes that could have been made in either mistyping or just given the wrong data by accident. |
