<a href="https://colab.research.google.com/github/ElisabethShah/DS-Unit-2-Linear-Models/blob/master/module2-doing-linear-regression/Linear_Regression_Assignment_Notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Doing Linear Regression

### Objectives
- arrange data into X features matrix and y target vector
- use scikit-learn for linear regression
- use regression metric: MAE
- do one-hot encoding
- scale features

In [1]:
!pip install category_encoders



In [2]:
!pip install -U pandas-profiling

Requirement already up-to-date: pandas-profiling in /usr/local/lib/python3.6/dist-packages (2.1.2)


In [0]:
import numpy as np
import pandas as pd
import pandas_profiling

from sklearn.metrics import mean_absolute_error as MAE
from sklearn.metrics import mean_squared_error as MSE

from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import StandardScaler

## Project: Predict NYC apartment rent 🏠💸

You'll use a real-world data with rent prices for a subset of apartments in New York City!


### Define the data on which you'll train

- Get the data
- What's the target?
- Regression or classification?

In [0]:
LOCAL = '../data/nyc/nyc-rent-2016.csv'
WEB = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Linear-Models/master/data/nyc/nyc-rent-2016.csv'

df = pd.read_csv(WEB)
assert df.shape == (48300, 34)

In [5]:
df.profile_report()



## Do train/test split
 
 For this project, we'll split based on time. 

- Use data from April & May 2016 to train.
- Use data from June 2016 to test.

In [6]:
df['created'] = pd.to_datetime(df['created'], infer_datetime_format=True)
df['created'].describe()

count                   48300
unique                  47643
top       2016-06-12 13:20:45
freq                        3
first     2016-04-01 22:12:41
last      2016-06-29 21:41:47
Name: created, dtype: object

In [0]:
df['month'] = df['created'].dt.month

In [0]:
train = df.query('month < 6')
test = df.query('month == 6')

In [9]:
train.shape, test.shape

((31515, 35), (16785, 35))

## Begin with baselines for regression

In [10]:
train['price'].mean()

3432.7534190068222

In [11]:
y_test = test['price']
y_pred = np.full_like(y_test, fill_value=train['price'].mean())
print(len(y_test), len(y_pred))
print(y_pred)
print(f'MAE: {MAE(y_test, y_pred)}')
print(f'RMSE: {np.sqrt(MSE(y_test, y_pred))}')

16785 16785
[3432 3432 3432 ... 3432 3432 3432]
MAE: 1052.5193327375632
RMSE: 1407.0359503030966


In [12]:
df['price'].std()

1401.4222466501867

## Encode categorical features

### Which features are non-numeric?

In [13]:
train.describe(exclude='number')

Unnamed: 0,created,description,display_address,street_address,interest_level
count,31515,30549.0,31447,31509,31515
unique,31116,25482.0,6492,11247,3
top,2016-05-02 03:41:36,,Broadway,505 West 37th Street,low
freq,3,897.0,268,120,21613
first,2016-04-01 22:12:41,,,,
last,2016-05-31 23:10:48,,,,


In [14]:
train['interest_level'].value_counts()

low       21613
medium     7400
high       2502
Name: interest_level, dtype: int64

In [0]:
encoder = OrdinalEncoder(categories=[['low', 'medium', 'high']])
encoded = encoder.fit_transform(np.array(train['interest_level']).reshape(-1,1))

In [16]:
pd.DataFrame(encoded)[0].value_counts()

0.0    21613
1.0     7400
2.0     2502
Name: 0, dtype: int64