<!--
Author: Brian Thomas Ross <admin@brianthomasross.com>
License: BSD-3-Clause
-->

# Concepts
----

### What is the purpose of performing a train-test split on your data when creating predictive models?

To have untouched data that is able to accurately predict the model trained with the majority of data.

### In data where dates or times might be considered determining factors, what should one take care to do when performing the train test split

Assuming the date/time is in Pandas format, it can be used as a method to split the model, or they can be cahnged into their own columns.

### What is data leakage?

The marginal loss of valuable data, leading to misleading results.

### What are some potential business use-cases for linear regression models

Viewing future sales numbers, finding trends in environmental concerns.

## Watch
----

Watch the following short video which gives a brief overview of $\mathbb{R}^2$
<div>
<iframe width="560" height="315" src="https://www.youtube.com/embed/Q-TtIPF0fCU" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
</div>

In [1]:
from IPython.display import HTML

HTML('<iframe width="560" height="315" src="https://www.youtube.com/embed/Q-TtIPF0fCU" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>')



## Explore
----

Take some time to explore the dataset that we will be using during lecture.

In [5]:
%%capture
import sys

# If you're working locally:
DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Applied-Modeling/master/data/'

In [28]:
import numpy as np
import pandas as pd

# Read New York City apartment rental listing data
df = pd.read_csv(DATA_PATH+'apartments/renthop-nyc.csv')
assert df.shape == (49352, 34)

# Remove the most extreme 1% prices,
# the most extreme .1% latitudes, &
# the most extreme .1% longitudes
df = df[(df['price'] >= np.percentile(df['price'], 0.5)) & 
        (df['price'] <= np.percentile(df['price'], 99.5)) & 
        (df['latitude'] >= np.percentile(df['latitude'], 0.05)) & 
        (df['latitude'] < np.percentile(df['latitude'], 99.95)) &
        (df['longitude'] >= np.percentile(df['longitude'], 0.05)) & 
        (df['longitude'] <= np.percentile(df['longitude'], 99.95))]

# Do train/test split
# Use data from April & May 2016 to train
# Use data from June 2016 to test
df['created'] = pd.to_datetime(df['created'], infer_datetime_format=True)

In [31]:
df = df.drop(["description", "display_address", "street_address",
                   "latitude"], axis=1)

In [32]:
cutoff = pd.to_datetime('2016-06-01')
train = df[df["created"] < cutoff]
test  = df[df["created"] >= cutoff]

In [29]:
# Assert statements can be helpful to avoid simple errors
assert len(train) + len(test) == len(df)

In [33]:
train = train.drop(["created"], axis=1)

In [34]:
train.head()

Unnamed: 0,bathrooms,bedrooms,longitude,price,interest_level,elevator,cats_allowed,hardwood_floors,dogs_allowed,doorman,...,high_speed_internet,balcony,swimming_pool,new_construction,terrace,exclusive,loft,garden_patio,wheelchair_access,common_outdoor_space
2,1.0,1,-74.0018,2850,high,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1.0,1,-73.9677,3275,low,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
4,1.0,4,-73.9493,3350,low,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,2.0,4,-74.0028,7995,medium,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,1.0,2,-73.966,3600,low,0,1,0,1,0,...,0,0,0,0,0,0,0,0,0,0


In [35]:
train.describe()

Unnamed: 0,bathrooms,bedrooms,longitude,price,elevator,cats_allowed,hardwood_floors,dogs_allowed,doorman,dishwasher,...,high_speed_internet,balcony,swimming_pool,new_construction,terrace,exclusive,loft,garden_patio,wheelchair_access,common_outdoor_space
count,31844.0,31844.0,31844.0,31844.0,31844.0,31844.0,31844.0,31844.0,31844.0,31844.0,...,31844.0,31844.0,31844.0,31844.0,31844.0,31844.0,31844.0,31844.0,31844.0,31844.0
mean,1.203728,1.528357,-73.972867,3575.604007,0.53043,0.477139,0.480907,0.445861,0.430725,0.418666,...,0.08862,0.060734,0.055929,0.05147,0.047733,0.042269,0.044216,0.039222,0.028388,0.029048
std,0.472447,1.105061,0.02891,1762.136694,0.499081,0.499485,0.499643,0.497068,0.495185,0.493348,...,0.284198,0.238845,0.229788,0.220957,0.213203,0.201204,0.205577,0.194127,0.166082,0.167943
min,0.0,0.0,-74.0873,1375.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,1.0,1.0,-73.9918,2500.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,1.0,1.0,-73.9781,3150.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,1.0,2.0,-73.955,4095.0,1.0,1.0,1.0,1.0,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,10.0,7.0,-73.7001,15500.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [36]:
train.describe(exclude='number')

Unnamed: 0,interest_level
count,31844
unique,3
top,low
freq,22053


In [37]:
for col in train:
    print(col)

bathrooms
bedrooms
longitude
price
interest_level
elevator
cats_allowed
hardwood_floors
dogs_allowed
doorman
dishwasher
no_fee
laundry_in_building
fitness_center
pre-war
laundry_in_unit
roof_deck
outdoor_space
dining_room
high_speed_internet
balcony
swimming_pool
new_construction
terrace
exclusive
loft
garden_patio
wheelchair_access
common_outdoor_space


In [38]:
train = train.replace({"low":1, "medium":2, "high":3})

In [39]:
test = test.replace({"low":1, "medium":2, "high":3})

In [18]:
train.describe(exclude="number")

Unnamed: 0,created,description,display_address,street_address
count,31844,30875.0,31775,31838
unique,31436,25735.0,6468,11280
top,2016-05-14 01:11:03,,Broadway,505 West 37th Street
freq,3,906.0,273,120
first,2016-04-01 22:12:41,,,
last,2016-05-31 23:10:48,,,


In [21]:
import plotly.express as px

px.scatter(train, x="longitude", y="price", trendline="ols")

## Road to Local Development
----

Take the remainder of the time until lecture to begin backing up the important things on your computer. Some popular choices for backing up to the cloud include:


- [OneDrive](https://www.microsoft.com/en-us/microsoft-365/onedrive/online-cloud-storage)
- [Google Drive](https://www.google.com/drive/)
- [DropBox](https://www.dropbox.com/basic)

I can nearly guarantee that at some point during your careers in this industry that you will find yourself resetting your machines, being forced to reinstall your operating systems, thanks to catastrophic environment failure. A good admin knows that if something on your computer is important to you, that you should have it backed up in 2 separate locations.