Lambda School Data Science

*Unit 2, Sprint 1, Module 2*

---

In [66]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/pixeltests/datasets/main/'

# If you're working locally:
else:
    DATA_PATH = '../data/'

# Module Project: Regression II

In this project, you'll continue working with the New York City rent dataset you used in the last module project.

## Directions

The tasks for this project are as follows:

- **Task 1:** Import `csv` file using `wrangle` function.
- **Task 2:** Conduct exploratory data analysis (EDA), and modify `wrangle` function to engineer two new features.
- **Task 3:** Split data into feature matrix `X` and target vector `y`.
- **Task 4:** Split feature matrix `X` and target vector `y` into training and test sets.
- **Task 5:** Establish the baseline mean absolute error for your dataset.
- **Task 6:** Build and train a `Linearregression` model.
- **Task 7:** Calculate the training and test mean absolute error for your model.
- **Task 8:** Calculate the training and test $R^2$ score for your model.
- **Stretch Goal:** Determine the three most important features for your linear regression model.

**Note**

You should limit yourself to the following libraries for this project:

- `matplotlib`
- `numpy`
- `pandas`
- `sklearn`

# I. Wrangle Data

In [79]:
import numpy as np
def wrangle(filepath):
    df = pd.read_csv(filepath)

    # Remove the most extreme 1% prices,
    # the most extreme .1% latitudes, &
    # the most extreme .1% longitudes
    df = df[(df['price'] >= np.percentile(df['price'], 0.5)) &
            (df['price'] <= np.percentile(df['price'], 99.5)) &
            (df['latitude'] >= np.percentile(df['latitude'], 0.05)) &
            (df['latitude'] < np.percentile(df['latitude'], 99.95)) &
            (df['longitude'] >= np.percentile(df['longitude'], 0.05)) &
            (df['longitude'] <= np.percentile(df['longitude'], 99.95))]
    # df['created'] = pd.to_datetime(df['created'])
    # df.set_index('created', inplace=True)
    df['pets_allowed'] = df['cats_allowed']+df['dogs_allowed']
    df['total_rooms'] = df['bathrooms']+df['bedrooms']
    df.drop('description',axis=1, inplace=True)
    df.drop('display_address',axis=1, inplace=True)
    df.drop('interest_level',axis=1, inplace=True)
    df.drop('street_address',axis=1, inplace=True)
    df.dropna()
    return df

filepath = DATA_PATH + 'renthop-nyc.csv'

**Task 1:** Add the following functionality to the above `wrangle` function.

- The `'created'` column will parsed as a `DateTime` object and set as the `index` of the DataFrame.
- Rows with `NaN` values will be dropped.

Then use your modified function to import the `renthop-nyc.csv` file into a DataFrame named `df`.

In [80]:
import pandas as pd
df = wrangle(filepath)
df.head(2)
# df.isnull().sum().sum()
# df.dropna()


Unnamed: 0,bathrooms,bedrooms,created,latitude,longitude,price,elevator,cats_allowed,hardwood_floors,dogs_allowed,...,swimming_pool,new_construction,terrace,exclusive,loft,garden_patio,wheelchair_access,common_outdoor_space,pets_allowed,total_rooms
0,1.5,3,2016-06-24 07:54:24,40.7145,-73.9425,3000,0,0,0,0,...,0,0,0,0,0,0,0,0,0,4.5
1,1.0,2,2016-06-12 12:19:27,40.7947,-73.9667,5465,1,1,0,1,...,0,0,0,0,0,0,0,0,2,3.0


**Task 2:** Using your `pandas` and dataviz skills decide on two features that you want to engineer for your dataset. Next, modify your `wrangle` function to add those features.

**Note:** You can learn more about feature engineering [here](https://en.wikipedia.org/wiki/Feature_engineering). Here are some ideas for new features:

- Does the apartment have a description? Yes
- Length of description.
- Total number of perks that apartment has.
- Are cats _or_ dogs allowed?
- Are cats _and_ dogs allowed?
- Total number of rooms (beds + baths).

In [45]:
# Conduct your exploratory data analysis here,
# and then modify the function above.
# print('Descriptions : ',df['description'].head(4))

# print('Length of Description minimum : ',df['description'].str.len().min())
# print('Length of Description maximu : ',df['description'].str.len().max())

# print('Total number of perks that apartment has Fitness Center, swimming pool etc')

# print('Are cats allowed : ',df['cats_allowed'].describe())

# print('Are dogs allowed : ',df['dogs_allowed'].describe())

# df['pets_allowed'] = df['cats_allowed']+df['dogs_allowed']
# print('Are cats and dogs allowed : ',df['pets_allowed'].head(2))

# df['total_rooms'] = df['bathrooms']+df['bedrooms']
# print('Total Rooms :  ',df['total_rooms'].head(2))

In [81]:
df.head(3)
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 48817 entries, 0 to 49351
Data columns (total 32 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   bathrooms             48817 non-null  float64
 1   bedrooms              48817 non-null  int64  
 2   created               48817 non-null  object 
 3   latitude              48817 non-null  float64
 4   longitude             48817 non-null  float64
 5   price                 48817 non-null  int64  
 6   elevator              48817 non-null  int64  
 7   cats_allowed          48817 non-null  int64  
 8   hardwood_floors       48817 non-null  int64  
 9   dogs_allowed          48817 non-null  int64  
 10  doorman               48817 non-null  int64  
 11  dishwasher            48817 non-null  int64  
 12  no_fee                48817 non-null  int64  
 13  laundry_in_building   48817 non-null  int64  
 14  fitness_center        48817 non-null  int64  
 15  pre-war            

# II. Split Data

**Task 3:** Split your DataFrame `df` into a feature matrix `X` and the target vector `y`. You want to predict `'price'`.

**Note:** In contrast to the last module project, this time you should include _all_ the numerical features in your dataset.

In [87]:

train = df.loc[(df['created']<'2016-05-31')&(df['created']>'2016-04-01')]
test = df.loc[(df['created']<'2016-06-30')&(df['created']>'2016-06-01')]


**Task 4:** Split `X` and `y` into a training set (`X_train`, `y_train`) and a test set (`X_test`, `y_test`).

- Your training set should include data from April and May 2016.
- Your test set should include data from June 2016.

In [88]:
X_train = train.drop('price',axis = 1)
y_train = train['price']

X_test = test.drop('price',axis = 1)
y_test = test['price']

# III. Establish Baseline

**Task 5:** Since this is a **regression** problem, you need to calculate the baseline mean absolute error for your model. First, calculate the mean of `y_train`. Next, create a list `y_pred` that has the same length as `y_train` and where every item in the list is the mean. Finally, use `mean_absolute_error` to calculate your baseline.

In [89]:
guess = y_train.mean()
guess
print(guess)
errors = guess - y_train
baseline_mae = errors.abs().mean()
print('Baseline MAE:', baseline_mae)


3575.737743019567
Baseline MAE: 1201.9731808995773


# IV. Build Model

**Task 6:** Build and train a `LinearRegression` model named `model` using your feature matrix `X_train` and your target vector `y_train`.

In [90]:
# Step 1: Import predictor class
from sklearn.linear_model import LinearRegression

# Step 2: Instantiate predictor
model = LinearRegression()

X_train.drop('created',axis=1,inplace=True)
X_test.drop('created',axis=1,inplace=True)
# Step 3: Fit predictor on the (training) data
model.fit(X_train, y_train)

# V. Check Metrics

**Task 7:** Calculate the training and test mean absolute error for your model.

In [91]:
from sklearn.metrics import mean_absolute_error

# we predict the y from X train
y_train_pred = model.predict(X_train)
training_mae = mean_absolute_error(y_train,y_train_pred)

# we predict the y from X test
y_test_pred = model.predict(X_test)
test_mae = mean_absolute_error(y_test,y_test_pred)

print('Training MAE:', training_mae)
print('Test MAE:', test_mae)

Training MAE: 692.9500099323958
Test MAE: 701.458812114077


**Task 8:** Calculate the training and test $R^2$ score for your model.

In [92]:
from sklearn.metrics import r2_score

training_r2 = model.score(X_train,y_train)
test_r2 = model.score(X_test,y_test)

print('Training MAE:', training_r2)
print('Test MAE:', test_r2)

Training MAE: 0.6172885090621005
Test MAE: 0.6257149339932598


# VI. Communicate Results

**Stretch Goal:** What are the three most influential coefficients in your linear model? You should consider the _absolute value_ of each coefficient, so that it doesn't matter if it's positive or negative.

In [103]:
coefficient = model.coef_
# # intercept = model.intercept_
# print(coefficient)

# # sorting the top three
np.sort(abs(coefficient))


array([1.98730469e+01, 4.59418945e+01, 4.92333984e+01, 5.22176856e+01,
       5.55859375e+01, 5.97903182e+01, 6.87059716e+01, 8.00413909e+01,
       1.01316833e+02, 1.11424489e+02, 1.15103394e+02, 1.22753288e+02,
       1.36573730e+02, 1.40625594e+02, 1.48882019e+02, 1.67509429e+02,
       1.69690127e+02, 1.80670898e+02, 1.88901855e+02, 2.06089498e+02,
       2.42246094e+02, 2.60858235e+02, 3.42072266e+02, 4.82193707e+02,
       4.97404890e+02, 1.64031919e+03, 1.36822721e+04, 6.89044337e+14,
       6.89044337e+14, 6.89044337e+14])

In [None]:
# Top three absolute values of coefficients are 6.89044337e+14, 6.89044337e+14, 6.89044337e+14
689