<a href="https://colab.research.google.com/github/hadi-M/DS-Unit-2-Linear-Models/blob/master/module2-regression-2/HADI_MODARES_LS_DS_212_assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science

*Unit 2, Sprint 1, Module 2*

---

In [0]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Applied-Modeling/master/data/'
    !pip install category_encoders==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'
    
# Ignore this Numpy warning when using Plotly Express:
# FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
import warnings
warnings.filterwarnings(action='ignore', category=FutureWarning, module='numpy')

In [0]:
import numpy as np
import pandas as pd

# Read New York City apartment rental listing data
df = pd.read_csv(DATA_PATH+'apartments/renthop-nyc.csv')
assert df.shape == (49352, 34)

# Remove the most extreme 1% prices,
# the most extreme .1% latitudes, &
# the most extreme .1% longitudes
df = df[(df['price'] >= np.percentile(df['price'], 0.5)) & 
        (df['price'] <= np.percentile(df['price'], 99.5)) & 
        (df['latitude'] >= np.percentile(df['latitude'], 0.05)) & 
        (df['latitude'] < np.percentile(df['latitude'], 99.95)) &
        (df['longitude'] >= np.percentile(df['longitude'], 0.05)) & 
        (df['longitude'] <= np.percentile(df['longitude'], 99.95))]

# Regression 2

## Assignment

You'll continue to **predict how much it costs to rent an apartment in NYC,** using the dataset from renthop.com.

- [ ] Do train/test split. Use data from April & May 2016 to train. Use data from June 2016 to test.

In [0]:
def create_train_and_test(df):
    
    # converting string datetime to a datetime object
    df["created"] = pd.to_datetime(df["created"])

    # creating train and test dataframes
    df_train = df[ df["created"].dt.month_name().isin(["April", "May"]) ]
    df_test = df[ df["created"].dt.month_name().isin(["June"]) ]

    # checking if all of the rows of df is indeed in train and test dataframes
    assert df_train.shape[0] + df_test.shape[0] == df.shape[0]

    X_train = df_train.drop("price", axis=1)
    y_train = df_train["price"]
    X_test = df_test.drop("price", axis=1)
    y_test = df_test["price"]

    return X_train, y_train, X_test, y_test


# i will do the splitting after creating features


- [ ] Engineer at least two new features. (See below for explanation & ideas.)


In [0]:
# i will try to add all of the features:


> Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work. 

#### Feature Ideas
- Does the apartment have a description?


In [71]:
assert df["description"].isna().sum() > 0
# so the columns with no description, probably are set to np.nan

df["has_description"] = (~df["description"].isna()).astype(int)
df["has_description"]

0        1
1        1
2        1
3        1
4        1
        ..
49347    1
49348    1
49349    1
49350    1
49351    1
Name: has_description, Length: 48817, dtype: int64

- How long is the description?


In [0]:
# filling np.nan values with empty strings and the getting the length of them
df["description_length"] = df["description"].fillna('').str.len()

df["description_length"] = df["description_length"].astype(int)

- How many total perks does each apartment have?

In [0]:
perk_columns = df.columns.to_series().loc["elevator": "common_outdoor_space"]
df["perk_count"] = df[perk_columns].sum(axis=1)
# df["perk_count"]


- Are cats _or_ dogs allowed?


In [0]:
df["cats_or_dogs"] = df["cats_allowed"] | df["dogs_allowed"]

- Are cats _and_ dogs allowed?


In [0]:
df["cats_and_dogs"] = df["cats_allowed"] & df["dogs_allowed"]

- Total number of rooms (beds + baths)


In [0]:
df["total_rooms"] = df["bedrooms"] + df["bathrooms"]
# df["total_rooms"].value_counts(dropna=False)

- Ratio of beds to baths


In [0]:
df["beds_to_baths_ratio"] = df["bedrooms"] / df["bathrooms"].replace(0, 1)
df["beds_to_baths_ratio"].replace(np.inf, np.nan, inplace=True)
# df["beds_to_baths_ratio"].value_counts(dropna=False)

- What's the neighborhood, based on address or latitude & longitude

In [0]:
# ?

- [ ] Fit a linear regression model with at least two features.
- [ ] Get the model's coefficients and intercept.
- [ ] Get regression metrics RMSE, MAE, and $R^2$, for both the train and test data.
- [ ] What's the best test MAE you can get? Share your score and features used with your cohort on Slack!
- [ ] As always, commit your notebook to your fork of the GitHub repo.

In [0]:
# ok, I'm gonna do some stuff
# first I'm gonna sort the features by their correlation with price
# then abs() them and sort them
# then I will add them one by one to regression and see how much do the errors change
# good stuff right?

In [80]:
# choosing only numeric columns
numerics_df = df.select_dtypes(np.number)

price_corr = numerics_df.corr()["price"]
sorted_abs_corrs_with_price = price_corr.abs().sort_values().iloc[:-1]
sorted_abs_corrs_with_price

loft                    0.007100
has_description         0.008946
common_outdoor_space    0.011517
exclusive               0.013251
laundry_in_building     0.019417
pre-war                 0.029122
latitude                0.036286
cats_or_dogs            0.050989
cats_allowed            0.051453
dogs_allowed            0.060401
cats_and_dogs           0.060873
new_construction        0.071431
wheelchair_access       0.072517
high_speed_internet     0.090269
hardwood_floors         0.101503
garden_patio            0.103672
roof_deck               0.122929
no_fee                  0.132240
swimming_pool           0.134513
balcony                 0.139140
outdoor_space           0.142146
terrace                 0.145973
description_length      0.161948
elevator                0.207169
beds_to_baths_ratio     0.215605
dishwasher              0.223899
fitness_center          0.228775
dining_room             0.242911
longitude               0.251004
laundry_in_unit         0.271195
doorman   

In [0]:
X_train, y_train, X_test, y_test = create_train_and_test(df)

X_train = X_train.select_dtypes(np.number)
X_test = X_test.select_dtypes(np.number)

In [0]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from pdb import set_trace as st

def root_mean_squared_error(x1, x2):
    return np.sqrt(mean_squared_error(x1, x2))

In [0]:
# X_train.isna().sum()

In [148]:
def score_calculator(X_train, y_train, X_test, y_test):
    model = LinearRegression()
    model.fit(X_train, y_train)
    y_pred_train = model.predict(X_train)
    train_errors = (
        mean_absolute_error(y_train, y_pred_train),
        mean_squared_error(y_train, y_pred_train),
        root_mean_squared_error(y_train, y_pred_train),
        r2_score(y_train, y_pred_train)
    )

    y_pred_test = model.predict(X_test)
    test_errors = (
        mean_absolute_error(y_test, y_pred_test),
        mean_squared_error(y_test, y_pred_test),
        root_mean_squared_error(y_test, y_pred_test),
        r2_score(y_test, y_pred_test)
    )
    return train_errors, test_errors


def model_score_vs_number_of_features_graph(sorted_features):
    
    graph_col_names = [
        "train_mae",
        "train_mse",
        "train_rmse",
        "train_r2",
        "test_mae",
        "test_mse",
        "test_rmse",
        "test_r2"
    ]
    graph_df = pd.DataFrame(columns=graph_col_names)

    for feature_count in range(1, len(sorted_features)):
        features = sorted_features[0:feature_count]
        train_errors, test_errors = score_calculator(X_train[features], y_train, X_test[features], y_test)
        graph_df.loc[feature_count] = train_errors + test_errors
    
    return graph_df


graph_df = model_score_vs_number_of_features_graph(sorted_abs_corrs_with_price.index)

display(graph_df.head())

Unnamed: 0,train_mae,train_mse,train_rmse,train_r2,test_mae,test_mse,test_rmse,test_r2
1,1201.890936,3105028.0,1762.108996,3.408185e-08,1197.712262,3108130.0,1762.988911,-3.5e-05
2,1201.609904,3104545.0,1761.971951,0.0001555746,1197.442971,3108365.0,1763.055457,-0.00011
3,1201.58468,3104408.0,1761.932975,0.000199808,1197.029497,3107631.0,1762.847538,0.000125
4,1201.283423,3103659.0,1761.720382,0.0004410628,1196.41958,3106454.0,1762.513668,0.000504
5,1200.451606,3100856.0,1760.924649,0.00134382,1196.213061,3102650.0,1761.434211,0.001728


In [0]:
from bokeh.io import output_notebook, show
from bokeh.plotting import figure
from bokeh.models import ColumnDataSource
from bokeh.layouts import row, column

In [0]:


def drawer(width, height, source, legend_pos):
    p1 = figure(plot_width=width, plot_height=height, title="R squared VS #features", x_axis_label="number of features", y_axis_label="R squared", toolbar_location=None)
    p1.line("index", "train_r2", source=source, color="red", legend_label="train_r2")
    p1.line("index", "test_r2", source=source, color="blue", legend_label="test_r2")
    p1.legend.location = legend_pos

    p2 = figure(plot_width=width, plot_height=height, title="MSE VS #features", x_axis_label="number of features", y_axis_label="MSE", toolbar_location=None)
    p2.line("index", "train_mse", source=source, color="red", legend_label="train_mse")
    p2.line("index", "test_mse", source=source, color="blue", legend_label="test_mse")
    p2.legend.location = legend_pos

    p3 = figure(plot_width=width, plot_height=height, title="MAE VS #features", x_axis_label="number of features", y_axis_label="MAE", toolbar_location=None)
    p3.line("index", "train_mae", source=source, color="red", legend_label="train_mae")
    p3.line("index", "test_mae", source=source, color="blue", legend_label="test_mae")
    p3.legend.location = legend_pos

    p4 = figure(plot_width=width, plot_height=height, title="RMSE VS #features", x_axis_label="number of features", y_axis_label="RMSE", toolbar_location=None)
    p4.line("index", "train_rmse", source=source, color="red", legend_label="train_rmse")
    p4.line("index", "test_rmse", source=source, color="blue", legend_label="test_rmse")
    p4.legend.location = legend_pos

    output_notebook()
    # show(column(row(p1, p2), row(p3, p4)))
    show(row(p1, p2, p3, p4))



In [162]:
graph_df = model_score_vs_number_of_features_graph(sorted_abs_corrs_with_price.index)
source = ColumnDataSource(data=graph_df)
drawer(300, 300, source, "top_left")

In [170]:
graph_df = model_score_vs_number_of_features_graph(sorted_abs_corrs_with_price.index[::-1])
source = ColumnDataSource(data=graph_df)
drawer(300, 300, source, "top_right")




#### [Feature Engineering](https://en.wikipedia.org/wiki/Feature_engineering)

> "Some machine learning projects succeed and some fail. What makes the difference? Easily the most important factor is the features used." — Pedro Domingos, ["A Few Useful Things to Know about Machine Learning"](https://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf)

> "Coming up with features is difficult, time-consuming, requires expert knowledge. 'Applied machine learning' is basically feature engineering." — Andrew Ng, [Machine Learning and AI via Brain simulations](https://forum.stanford.edu/events/2011/2011slides/plenary/2011plenaryNg.pdf) 


## Stretch Goals
- [ ] If you want more math, skim [_An Introduction to Statistical Learning_](http://faculty.marshall.usc.edu/gareth-james/ISL/ISLR%20Seventh%20Printing.pdf),  Chapter 3.1, Simple Linear Regression, & Chapter 3.2, Multiple Linear Regression
- [ ] If you want more introduction, watch [Brandon Foltz, Statistics 101: Simple Linear Regression](https://www.youtube.com/watch?v=ZkjP5RJLQF4)
(20 minutes, over 1 million views)
- [ ] Add your own stretch goal(s) !