Lambda School Data Science

*Unit 2, Sprint 1, Module 2*

---

In [145]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Applied-Modeling/master/data/'

# If you're working locally:
else:
    DATA_PATH = '../data/'

# Module Project: Regression II

In this project, you'll continue working with the New York City rent dataset you used in the last module project.

## Directions

The tasks for this project are as follows:

- **Task 1:** Import `csv` file using `wrangle` function.
- **Task 2:** Conduct exploratory data analysis (EDA), and modify `wrangle` function to engineer two new features.
- **Task 3:** Split data into feature matrix `X` and target vector `y`.
- **Task 4:** Split feature matrix `X` and target vector `y` into training and test sets.
- **Task 5:** Establish the baseline mean absolute error for your dataset.
- **Task 6:** Build and train a `Linearregression` model.
- **Task 7:** Calculate the training and test mean absolute error for your model.
- **Task 8:** Calculate the training and test $R^2$ score for your model.
- **Stretch Goal:** Determine the three most important features for your linear regression model.

**Note**

You should limit yourself to the following libraries for this project:

- `matplotlib`
- `numpy`
- `pandas`
- `sklearn`

# I. Wrangle Data

In [146]:
import pandas as pd
import numpy as np

def wrangle(filepath):
    df = pd.read_csv(filepath, parse_dates=['created'], index_col='created')
    
    # Remove the most extreme 1% prices,
    # the most extreme .1% latitudes, &
    # the most extreme .1% longitudes
    df = df[(df['price'] >= np.percentile(df['price'], 0.5)) & 
            (df['price'] <= np.percentile(df['price'], 99.5)) & 
            (df['latitude'] >= np.percentile(df['latitude'], 0.05)) & 
            (df['latitude'] < np.percentile(df['latitude'], 99.95)) &
            (df['longitude'] >= np.percentile(df['longitude'], 0.05)) & 
            (df['longitude'] <= np.percentile(df['longitude'], 99.95))]
    
    df.dropna()
    
    return df

filepath = DATA_PATH + 'apartments/renthop-nyc.csv'

**Task 1:** Add the following functionality to the above `wrangle` function.

- The `'created'` column will parsed as a `DateTime` object and set as the `index` of the DataFrame. 
- Rows with `NaN` values will be dropped.

Then use your modified function to import the `renthop-nyc.csv` file into a DataFrame named `df`.

In [147]:
df = wrangle(filepath)

**Task 2:** Using your `pandas` and dataviz skills decide on two features that you want to engineer for your dataset. Next, modify your `wrangle` function to add those features. 

**Note:** You can learn more about feature engineering [here](https://en.wikipedia.org/wiki/Feature_engineering). Here are some ideas for new features:

- Does the apartment have a description?
- Length of description.
- Total number of perks that apartment has.
- Are cats _or_ dogs allowed?
- Are cats _and_ dogs allowed?
- Total number of rooms (beds + baths).

In [148]:
# Conduct your exploratory data analysis here, 
# and then modify the function above.
df['perks'] = df['elevator'] + df['cats_allowed'] + df['hardwood_floors'] + df['dogs_allowed'] + df['doorman'] + df['dishwasher'] + df['no_fee'] + df['fitness_center'] + df['laundry_in_unit'] + df['roof_deck'] + df['outdoor_space'] + df['dining_room'] + df['high_speed_internet'] + df['balcony'] + df['swimming_pool'] + df['terrace'] + df['loft'] + df['garden_patio'] + df['wheelchair_access'] + df['common_outdoor_space']
df['rooms'] = df['bathrooms'] + df['bedrooms']

# remove_cols = []

# for cols in list(df.select_dtypes('object').columns):
#     if df[cols].nunique() > 10:
#       remove_cols.append(cols)

# df = df.drop(remove_cols, axis=1)

df = df[['price', 'perks', 'rooms']]
df.head()

Unnamed: 0_level_0,price,perks,rooms
created,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2016-06-24 07:54:24,3000,0,4.5
2016-06-12 12:19:27,5465,5,3.0
2016-04-17 03:26:41,2850,2,2.0
2016-04-18 02:22:02,3275,2,2.0
2016-04-28 01:32:41,3350,0,5.0


# II. Split Data

**Task 3:** Split your DataFrame `df` into a feature matrix `X` and the target vector `y`. You want to predict `'price'`.

**Note:** In contrast to the last module project, this time you should include _all_ the numerical features in your dataset.

In [149]:
# df = df[['price', 'perks']]
# df.head()

In [150]:
target = 'price'
y = df[target]
X = df.drop(columns=target)

**Task 4:** Split `X` and `y` into a training set (`X_train`, `y_train`) and a test set (`X_test`, `y_test`).

- Your training set should include data from April and May 2016. 
- Your test set should include data from June 2016.

In [151]:
import datetime as dt

# df[df.index > '2001-1-1']

In [152]:
trn_mask = (X.index > '2016-3-31') & (X.index < '2016-6-1')
tst_mask = (X.index > '2016-5-31') & (X.index < '2016-7-1')
X_train, y_train = X.loc[trn_mask], y.loc[trn_mask]
X_test, y_test = X.loc[tst_mask], y.loc[tst_mask]
# print(X_train)

# III. Establish Baseline

**Task 5:** Since this is a **regression** problem, you need to calculate the baseline mean absolute error for your model. First, calculate the mean of `y_train`. Next, create a list `y_pred` that has the same length as `y_train` and where every item in the list is the mean. Finally, use `mean_absolute_error` to calculate your baseline.

In [153]:
from sklearn.metrics import mean_absolute_error

y_pred = [y_train.mean()] * len(y_train)
baseline_mae = mean_absolute_error(y_train, y_pred)
print('Baseline MAE:', baseline_mae)

Baseline MAE: 1201.8811133682555


# IV. Build Model

**Task 6:** Build and train a `LinearRegression` model named `model` using your feature matrix `X_train` and your target vector `y_train`.

In [154]:
# Step 1: Import predictor class
from sklearn.linear_model import LinearRegression

# Step 2: Instantiate predictor
model = LinearRegression()

# Step 3: Fit predictor on the (training) data
model.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

# V. Check Metrics

**Task 7:** Calculate the training and test mean absolute error for your model.

In [155]:
y_pred_train_lr = model.predict(X_train)
y_pred_test_lr = model.predict(X_test)

training_mae = mean_absolute_error(y_train, y_pred_train_lr)
test_mae = mean_absolute_error(y_test, y_pred_test_lr)

print('Training MAE:', training_mae)
print('Test MAE:', test_mae)

Training MAE: 853.5535239009195
Test MAE: 864.4560011750385


**Task 8:** Calculate the training and test $R^2$ score for your model.

In [156]:
from sklearn.metrics import mean_squared_error

training_r2 = mean_squared_error(y_train, y_pred_train_lr, squared=False)
test_r2 = mean_squared_error(y_test, y_pred_test_lr, squared=False)

print('Training MAE:', training_r2)
print('Test MAE:', test_r2)

Training MAE: 1292.2320937080833
Test MAE: 1289.06589530519


# VI. Communicate Results

**Stretch Goal:** What are the three most influential coefficients in your linear model? You should consider the _absolute value_ of each coefficient, so that it doesn't matter if it's positive or negative.

In [157]:
print(model.coef_)

[109.1974652  764.77306413]


In [158]:
print(model.intercept_)

1008.6821880877105
