Lambda School Data Science

*Unit 2, Sprint 1, Module 2*

---

In [None]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Applied-Modeling/master/data/'

# If you're working locally:
#else:
    #DATA_PATH = '../data/'

# Module Project: Regression II

In this project, you'll continue working with the New York City rent dataset you used in the last module project.

## Directions

The tasks for this project are as follows:

- **Task 1:** Import `csv` file using `wrangle` function.
- **Task 2:** Conduct exploratory data analysis (EDA), and modify `wrangle` function to engineer two new features.
- **Task 3:** Split data into feature matrix `X` and target vector `y`.
- **Task 4:** Split feature matrix `X` and target vector `y` into training and test sets.
- **Task 5:** Establish the baseline mean absolute error for your dataset.
- **Task 6:** Build and train a `Linearregression` model.
- **Task 7:** Calculate the training and test mean absolute error for your model.
- **Task 8:** Calculate the training and test $R^2$ score for your model.
- **Stretch Goal:** Determine the three most important features for your linear regression model.

**Note**

You should limit yourself to the following libraries for this project:

- `matplotlib`
- `numpy`
- `pandas`
- `sklearn`

# I. Wrangle Data

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.linear_model import LinearRegression
from math import floor, ceil
from ipywidgets import interactive, IntSlider, FloatSlider
import datetime as dt


def wrangle(filepath):
    df = pd.read_csv(filepath,
                     parse_dates=['created'],
                     index_col='created')

    #drop_col = df.select_dtypes(include='object').columns
    #df.drop(columns=drop_col, inplace=True)
    
    # Remove the most extreme 1% prices,
    # the most extreme .1% latitudes, &
    # the most extreme .1% longitudes
    df = df[(df['price'] >= np.percentile(df['price'], 0.5)) & 
            (df['price'] <= np.percentile(df['price'], 99.5)) & 
            (df['latitude'] >= np.percentile(df['latitude'], 0.05)) & 
            (df['latitude'] < np.percentile(df['latitude'], 99.95)) &
            (df['longitude'] >= np.percentile(df['longitude'], 0.05)) & 
            (df['longitude'] <= np.percentile(df['longitude'], 99.95))]

    df['total_bedrooms'] = df['bathrooms'] + df['bedrooms']

    df['length_of_description'] = df['description'].str.len()

    #df['created'] = df.index
    #df['Year'] = df['created'].year
    df.dropna(inplace=True)

    return df


filepath = wrangle(DATA_PATH + 'apartments/renthop-nyc.csv')

In [None]:
df = filepath

In [None]:
df.isnull().sum().sum()

0

**Task 1:** Add the following functionality to the above `wrangle` function.

- The `'created'` column will parsed as a `DateTime` object and set as the `index` of the DataFrame. 
- Rows with `NaN` values will be dropped.

Then use your modified function to import the `renthop-nyc.csv` file into a DataFrame named `df`.

In [None]:
df.shape

(47260, 35)

In [None]:
df.head()

Unnamed: 0_level_0,bathrooms,bedrooms,description,display_address,latitude,longitude,price,street_address,interest_level,elevator,cats_allowed,hardwood_floors,dogs_allowed,doorman,dishwasher,no_fee,laundry_in_building,fitness_center,pre-war,laundry_in_unit,roof_deck,outdoor_space,dining_room,high_speed_internet,balcony,swimming_pool,new_construction,terrace,exclusive,loft,garden_patio,wheelchair_access,common_outdoor_space,total_bedrooms,length_of_description
created,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1
2016-06-24 07:54:24,1.5,3,A Brand New 3 Bedroom 1.5 bath ApartmentEnjoy ...,Metropolitan Avenue,40.7145,-73.9425,3000,792 Metropolitan Avenue,medium,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,4.5,588.0
2016-06-12 12:19:27,1.0,2,,Columbus Avenue,40.7947,-73.9667,5465,808 Columbus Avenue,low,1,1,0,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3.0,8.0
2016-04-17 03:26:41,1.0,1,"Top Top West Village location, beautiful Pre-w...",W 13 Street,40.7388,-74.0018,2850,241 W 13 Street,high,0,0,1,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2.0,691.0
2016-04-18 02:22:02,1.0,1,Building Amenities - Garage - Garden - fitness...,East 49th Street,40.7539,-73.9677,3275,333 East 49th Street,low,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2.0,492.0
2016-04-28 01:32:41,1.0,4,Beautifully renovated 3 bedroom flex 4 bedroom...,West 143rd Street,40.8241,-73.9493,3350,500 West 143rd Street,low,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,5.0,479.0


**Task 2:** Using your `pandas` and dataviz skills decide on two features that you want to engineer for your dataset. Next, modify your `wrangle` function to add those features. 

**Note:** You can learn more about feature engineering [here](https://en.wikipedia.org/wiki/Feature_engineering). Here are some ideas for new features:

- Does the apartment have a description?
- Length of description.
- Total number of perks that apartment has.
- Are cats _or_ dogs allowed?
- Are cats _and_ dogs allowed?
- Total number of rooms (beds + baths).

In [None]:
df.select_dtypes('object')

Unnamed: 0_level_0,description,display_address,street_address,interest_level
created,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2016-06-24 07:54:24,A Brand New 3 Bedroom 1.5 bath ApartmentEnjoy ...,Metropolitan Avenue,792 Metropolitan Avenue,medium
2016-06-12 12:19:27,,Columbus Avenue,808 Columbus Avenue,low
2016-04-17 03:26:41,"Top Top West Village location, beautiful Pre-w...",W 13 Street,241 W 13 Street,high
2016-04-18 02:22:02,Building Amenities - Garage - Garden - fitness...,East 49th Street,333 East 49th Street,low
2016-04-28 01:32:41,Beautifully renovated 3 bedroom flex 4 bedroom...,West 143rd Street,500 West 143rd Street,low
...,...,...,...,...
2016-06-02 05:41:05,"30TH/3RD, MASSIVE CONV 2BR IN LUXURY FULL SERV...",E 30 St,230 E 30 St,medium
2016-04-04 18:22:34,"HIGH END condo finishes, swimming pool, and ki...",Rector Pl,225 Rector Place,low
2016-04-16 02:13:40,Large Renovated One Bedroom Apartment with Sta...,West 45th Street,341 West 45th Street,low
2016-04-08 02:13:33,Stylishly sleek studio apartment with unsurpas...,Wall Street,37 Wall Street,low


In [None]:
df['description']

created
2016-06-24 07:54:24    A Brand New 3 Bedroom 1.5 bath ApartmentEnjoy ...
2016-06-12 12:19:27                                                     
2016-04-17 03:26:41    Top Top West Village location, beautiful Pre-w...
2016-04-18 02:22:02    Building Amenities - Garage - Garden - fitness...
2016-04-28 01:32:41    Beautifully renovated 3 bedroom flex 4 bedroom...
                                             ...                        
2016-06-02 05:41:05    30TH/3RD, MASSIVE CONV 2BR IN LUXURY FULL SERV...
2016-04-04 18:22:34    HIGH END condo finishes, swimming pool, and ki...
2016-04-16 02:13:40    Large Renovated One Bedroom Apartment with Sta...
2016-04-08 02:13:33    Stylishly sleek studio apartment with unsurpas...
2016-04-12 02:48:07    Look no further!!!  This giant 2 bedroom apart...
Name: description, Length: 47260, dtype: object

In [None]:
# Conduct your exploratory data analysis here, 
# and then modify the function above.

# Make feature for if each apartment has a description or not
#df['length_of_description'] = df['description'].str.len()



# Make feature for total number of rooms utilizing bedrooms and bathrooms
#df['total_bedrooms'] = df['bathrooms'] + df['bedrooms']

#df[['description', 'length_of_description']]

# II. Split Data

**Task 3:** Split your DataFrame `df` into a feature matrix `X` and the target vector `y`. You want to predict `'price'`.

**Note:** In contrast to the last module project, this time you should include _all_ the numerical features in your dataset.

In [None]:
X = df.drop(['description', 'display_address', 'street_address', 'interest_level'], axis=1)
y = df['price']

X

Unnamed: 0_level_0,bathrooms,bedrooms,latitude,longitude,price,elevator,cats_allowed,hardwood_floors,dogs_allowed,doorman,dishwasher,no_fee,laundry_in_building,fitness_center,pre-war,laundry_in_unit,roof_deck,outdoor_space,dining_room,high_speed_internet,balcony,swimming_pool,new_construction,terrace,exclusive,loft,garden_patio,wheelchair_access,common_outdoor_space,total_bedrooms,length_of_description
created,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1
2016-06-24 07:54:24,1.5,3,40.7145,-73.9425,3000,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,4.5,588.0
2016-06-12 12:19:27,1.0,2,40.7947,-73.9667,5465,1,1,0,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3.0,8.0
2016-04-17 03:26:41,1.0,1,40.7388,-74.0018,2850,0,0,1,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2.0,691.0
2016-04-18 02:22:02,1.0,1,40.7539,-73.9677,3275,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2.0,492.0
2016-04-28 01:32:41,1.0,4,40.8241,-73.9493,3350,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,5.0,479.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2016-06-02 05:41:05,1.0,2,40.7426,-73.9790,3200,1,0,1,0,0,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,3.0,787.0
2016-04-04 18:22:34,1.0,1,40.7102,-74.0163,3950,1,1,0,1,1,0,0,1,1,0,0,0,1,0,0,0,0,0,0,1,0,0,0,1,2.0,1125.0
2016-04-16 02:13:40,1.0,1,40.7601,-73.9900,2595,1,1,0,1,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2.0,671.0
2016-04-08 02:13:33,1.0,0,40.7066,-74.0101,3350,1,1,0,1,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1.0,735.0


**Task 4:** Split `X` and `y` into a training set (`X_train`, `y_train`) and a test set (`X_test`, `y_test`).

- Your training set should include data from April and May 2016. 
- Your test set should include data from June 2016.

In [None]:
cutoff = '2016-05-30 00:00:00'
mask = X.index < cutoff

X_train, y_train = X.loc[mask], y.loc[mask]
X_test, y_test = X.loc[~mask], y.loc[~mask]
X_test.tail()

Unnamed: 0_level_0,bathrooms,bedrooms,latitude,longitude,price,elevator,cats_allowed,hardwood_floors,dogs_allowed,doorman,dishwasher,no_fee,laundry_in_building,fitness_center,pre-war,laundry_in_unit,roof_deck,outdoor_space,dining_room,high_speed_internet,balcony,swimming_pool,new_construction,terrace,exclusive,loft,garden_patio,wheelchair_access,common_outdoor_space,total_bedrooms,length_of_description
created,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1
2016-06-16 04:20:46,1.0,2,40.8456,-73.9361,2295,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3.0,382.0
2016-06-21 06:25:35,1.0,3,40.7817,-73.9497,3995,1,0,1,0,1,1,1,0,1,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,4.0,2289.0
2016-06-02 13:24:18,1.0,1,40.7669,-73.9917,2727,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2.0,281.0
2016-06-06 01:22:44,1.0,2,40.7957,-73.9705,4850,0,1,0,1,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3.0,519.0
2016-06-02 05:41:05,1.0,2,40.7426,-73.979,3200,1,0,1,0,0,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,3.0,787.0


# III. Establish Baseline

**Task 5:** Since this is a **regression** problem, you need to calculate the baseline mean absolute error for your model. First, calculate the mean of `y_train`. Next, create a list `y_pred` that has the same length as `y_train` and where every item in the list is the mean. Finally, use `mean_absolute_error` to calculate your baseline.

In [None]:
y_pred = [y_train.mean()]*len(y_train)
baseline_mae = mean_absolute_error(y_train, y_pred)
print('Baseline MAE:', baseline_mae)

Baseline MAE: 1203.3593981356803


# IV. Build Model

**Task 6:** Build and train a `LinearRegression` model named `model` using your feature matrix `X_train` and your target vector `y_train`.

In [None]:
# Step 1: Import predictor class


# Step 2: Instantiate predictor
model = LinearRegression()

# Step 3: Fit predictor on the (training) data
model.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

# V. Check Metrics

**Task 7:** Calculate the training and test mean absolute error for your model.

In [None]:
def whats_my_rent(bedrooms):
    target_predict = model.predict([[bedroom]])
    estimate = target_predict[0]
    cof = model.coef_[0]
    result = f'$(estimate) for a {bedroom} bedroom apartment,'
    explanation = f'Scoat for additional rooms ${cof}'

training_mae = 'Train MAE:', mean_absolute_error(y_train, model.predict(X_train))
test_mae = 'Test MAE:', mean_absolute_error(y_test, model.predict(X_test))

print('Training MAE:', training_mae)
print('Test MAE:', test_mae)

Training MAE: ('Train MAE:', 6.074843092533057e-13)
Test MAE: ('Test MAE:', 6.084851311987526e-13)


**Task 8:** Calculate the training and test $R^2$ score for your model.

In [None]:
training_r2 = 'Training R^2 Score:', model.score(X_train, y_train)
test_r2 = 'Test R^2 Score:', model.score(X_test, y_test)

print('Training MAE:', training_r2)
print('Test MAE:', test_r2)

Training MAE: ('Training R^2 Score:', 1.0)
Test MAE: ('Test R^2 Score:', 1.0)


# VI. Communicate Results

**Stretch Goal:** What are the three most influential coefficients in your linear model? You should consider the _absolute value_ of each coefficient, so that it doesn't matter if it's positive or negative.

In [None]:
intercept = (model.intercept_)
print(intercept)

2.587512426543981e-10


In [None]:
coef = (model.coef_)
print(coef)

[-1.93586803e-13  2.63122857e-14 -1.45862342e-13  3.39355363e-12
  1.00000000e+00  1.20611463e-13 -1.19646081e-13  2.19536027e-14
 -1.00326706e-13  1.91452559e-13  4.45771073e-14 -4.44444353e-13
 -7.72224946e-14  3.37029151e-14  5.96266293e-14 -1.85487615e-13
  1.18222517e-13 -1.39520360e-13 -1.68178568e-13  5.57980481e-14
 -8.44996002e-14 -7.86422996e-14  8.51027216e-15 -9.76515079e-14
 -1.82987626e-14 -7.92837829e-14 -9.30797441e-14 -8.83288837e-15
  1.32748788e-13 -1.67360049e-13  9.71445147e-17]


In [None]:
f'Price = {intercept} + {coef} *latitude'

'Price = 2.587512426543981e-10 + [-1.93586803e-13  2.63122857e-14 -1.45862342e-13  3.39355363e-12\n  1.00000000e+00  1.20611463e-13 -1.19646081e-13  2.19536027e-14\n -1.00326706e-13  1.91452559e-13  4.45771073e-14 -4.44444353e-13\n -7.72224946e-14  3.37029151e-14  5.96266293e-14 -1.85487615e-13\n  1.18222517e-13 -1.39520360e-13 -1.68178568e-13  5.57980481e-14\n -8.44996002e-14 -7.86422996e-14  8.51027216e-15 -9.76515079e-14\n -1.82987626e-14 -7.92837829e-14 -9.30797441e-14 -8.83288837e-15\n  1.32748788e-13 -1.67360049e-13  9.71445147e-17] *latitude'

In [None]:
X_train.columns

Index(['bathrooms', 'bedrooms', 'latitude', 'longitude', 'price', 'elevator',
       'cats_allowed', 'hardwood_floors', 'dogs_allowed', 'doorman',
       'dishwasher', 'no_fee', 'laundry_in_building', 'fitness_center',
       'pre-war', 'laundry_in_unit', 'roof_deck', 'outdoor_space',
       'dining_room', 'high_speed_internet', 'balcony', 'swimming_pool',
       'new_construction', 'terrace', 'exclusive', 'loft', 'garden_patio',
       'wheelchair_access', 'common_outdoor_space', 'total_bedrooms',
       'length_of_description'],
      dtype='object')