__ML Pipeline - demonstrated in simple linear regression__

In [2]:
import warnings
warnings.filterwarnings("ignore")

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import env

## Acquire Data
**Deliverable: wrangle.py**

### Retrieve and understand Data

In [None]:
df = pd.read_csv("data/student_grades.csv")
# Making a list of missing value types
missing_values = ["n/a", "na", "--", " "]
df = pd.read_csv("property data.csv", na_values = missing_values)

df.head()
df.shape
df.describe()
df.info()

print(df.isnull().sum()) # find null
print(df.columns[df.isnull().any()])
df.exam3.value_counts(sort=True, ascending=True)
df.replace(r'^\s*$', np.nan, regex=True, inplace=True) # replace empty with null
df = df.dropna().astype('int')
df = df.fillna() # fill missing values with a value instead of dropping the rows. 
# or use imputation to take mean or something to fill in instead of dropping any value or row
# ex: fill in mean from same row, model prediction to fill in predicted value

df.info()

In [None]:
# summarize all data
def peek_data(df: pd.DataFrame):
    print('- Shape')
    print(df.shape)
    print('- Head and Tail')
    print(pd.concat([df.head(), df.tail()]))
    print('- Numeric Vars')
    print(df.describe())
    print('- String Columns')
    for col in df.select_dtypes('object'):
        print('--- {}'.format(col))
        print(df[col].value_counts().head())

peek_data(df)

### Visualize Distribution
**Histograms &/or boxplots** 
see distribution, skewness, outliers, unit scales.

In [None]:
plt.figure(figsize=(16, 3))

for i, col in enumerate(['exam1', 'exam2', 'exam3', 'final_grade']):  
    plot_number = i + 1 # i starts at 0, but plot nos should start at 1
    series = df[col]  
    plt.subplot(1,4, plot_number)
    plt.title(col)
    series.hist(bins=5)

In [None]:
# seaborn.boxplot default to plot all numeric variables if we don't specify specific x and y values.
# specify the columns to be dismissed
plt.figure(figsize=(8,4))
sns.boxplot(data=df.drop(columns=['student_id']))

## Visualization
**Deliverable: explore.py**

In [None]:
# view relationship between 1 var & target
sns.jointplot("exam1", "final_grade", data=train, kind='reg', height=5)

In [None]:
# seaborn.pairgrid + matplotlib.pyplot.hist + matplotlib.pyplot.scatter
# greater flexibility to customize the type of the plots in each position.

g = sns.PairGrid(train)
g.map_diag(plt.hist)
g.map_offdiag(plt.scatter)

# heatmap to show correlation
plt.figure(figsize=(8,6))
sns.heatmap(train.corr(), cmap='Blues', annot=True)
plt.ylim(0, 4)

## Preprocessing: Split and Scale 
**Deliverable: split_scale.py**

- Create training set & testing set

- random sampling

- split by % (ex: train:75% & test: 25%)

- 


- Create a scaled version of attributes and target so that we can compare the importance of each feature.

- Assure equal weight 

>e.g. age and weight

>weight would have more impace on a regression model purely because it is in larger units than age if we didn't scaled those<b>

In [None]:
import warnings
warnings.filterwarnings("ignore")

import pandas as pd
import numpy as np

import wrangle
import env

### Import cleaned data


In [None]:
# acquire data and remove null values 
df = wrangle.wrangle_grades()

# verify acquisition
df.info()
df.describe()

### Split data

In [None]:
from sklearn.model_selection import train_test_split
# Train Test Split
# set random seed so the randomization is reproducible
train, test = train_test_split(df, train_size = .80, random_state = 123)
print(train.shape); print(test.shape)

### Scale data: 
e.g. avoid particular attributes diluting out the importance of other attributes

use scatter plot to present before and after scaling

- **normalize the numeric range of the attributes**

- aka data normalization or standardization (if scaled to a mean of 0 and unit variance)

- performed between initial exploration and feature engineering.

- each attribute is scaled indpendently, e.g. the mean of exam 2 will not affect how exam 1 is scaled.

- thus, it is OKAY TO NOT SCALE ALL ATTRIBUTES

- helps to identify relationships such as correlations, while exploring

> create the scaler object: scaler
>
> fit: scaler.fit(train)
>
> train_scaled = scaler.transform(train) & test_scaled = scaler.transform(test)
>
> inverse the transformed data back to its original values scaler.inverse_transform(test)

**Method selection - val of outlier, (sqeeuze based on ratio?)**
- **StandardScalr** Standard Normal Distribution (mean=0, stdev=1), result = z-score
> linear transformer 
>
> For dev successful/ effective regression model, SVM, clustering algorithms 
>
> individual features need to resemble standard normally distributed data

- **QuantileTransformer (uniform)**

> **Distort correlations and distances within and across features**
>
> non-linear transformer 
>
> i.e. values are not the result of a linear function
>
> smooths out unusual distributions
>
> spreads out the most frequent values
>
> reduces the impact of (marginal) outliers 
>
> #sklearn exclusive, pandas is inclusive


- **PowerTransformer**

>**Gaussian Scaler** Scale to Gaussian-like distribution
>
> Use Box-Cox or Yeo-Johnson method to transform to resemble normal or standard normal distrubtion.
>
> default = zero-mean & unit-variance (standard normal).
>
>**Yeo-Johnson** supports both positive or negative data
>
> **Box-Cox** only supports positive data

- **RobustScaler**
> **Handle outliers**
>
> use mean and variance $\neq$ working 
> median is removed (instead of mean) 
> 
> **data is scaled according to a quantile range (the IQR is default)**

- **MinMaxScaler**
> linear transformation, coz derived from a linear function
>
> sensitive to outlier
> 
>**scale to a range**, result between the given range


The values for mean and variance that were computed from the training data in .fit() are stored with the scaler object, so that it can be used when scaling new data.

In [None]:
from sklearn.preprocessing import StandardScaler, QuantileTransformer, PowerTransformer, RobustScaler, MinMaxScaler

**Model Workflow: make it - fit it - use it**
>lr = LinearRegression()
>
>lr.fit(x,y)
>
>lr.predict(x)

**Feature Engineering, Scaling a variable**
>ms = minmaxscaler()
>
>ms.fit()
>
>ms.transform()

## Regression Models & Evaluation Techniques
foundation for evaluating the effectiveness of a regression model

### Residuals

- (observed value - the estimated value) 
- vertical distance from the original data point to the expected data point (x_observed = x_expected, y on regression line).

###  SUM OF SQUARED ERRORS (SSE OR RSS, RESIDUAL SUM OF SQUARES)

Target: outliers matter
How: SUM(Square each residual/error)

$SSE = \sum_{i = 1}^{n} (\hat{y} - yi)^2$

residual plot should be very random - meaning there's no bias in prediction

MSE = SSE / n 

RMSE = sqrt(MSE) # in summary/ average, our prediction is this far away from actual values


RSS residual sum of squre = SSE
ESS explained SS
SSE + ESS = TSS (total)

compare with baseline ($mean$)

82% of my final grade is explained(model-able) by exam 1 (82% is good num)

p-value very close to 0.05, and CI very wide (need large net to catch all the val), we might re-think about including the particular feature in the model


In [None]:
# Prep
df.head()
df.describe()
df.info

df.exam3.values_counts(ascending = False) # super helpful for busting weird, none numeric values
df.replace(r'^\d+$', np.nan, regex = True, inplace = True)
df = df.dropna.astype('int')
# outlier

# Distribution
vis

plt.figure(figsize = ( , ))
plt.hist(data = df.col_name, bin =     )

sns.boxplot(data = df)

# split - scale
split train & test (not needing to go x, y yet)
after scale, if do scatter plot on before scale vs after scale, will see the linear vs non-linear



In [None]:
ols ('y~x') model y using x

## Feature Engineering

"Feature engineering is the process of transforming raw data into features that better represent the underlying problem to the predictive models, resulting in improved model accuracy on unseen data." Jason Brownlee, Machine Learning Mastery

- use domain knowledge > construct ad hoc features.
> ad hoc 
>
> signifies a solution designed for a specific problem or task, non-generalizable, and not intended to be able to be adapted to other purposes 


- Perform baseline evaluation to determine if you even need a feature.

- feature selection & sample handling


>**variable ranking method** Assess features individually 
>> disect each feature & their impact on system
>>
>> filter out subset of features if too many or too messy/ noisy
>>
>> combine interdependent features?
>>
>> Scale features if not of similar proportion or units
>>
>> Decision about outliers, discard? ignore? replace?



>>**recursive feature elimination** 
>> recursively remove attributes to meet the number of required features 
>>
>> builds a model w/ remaining attributes
>>
>> evaluate model performance

>>**backward elimination** 
>> recursively remove the worst performing features one by one till the overall performance of the model comes in acceptable range.

>>**Forward selection**
>> begins w/ empty equation
>>
>> Study correlation: **predictors-independent var** vs **target-dependent**
>>
>> add var to model, begin with theoretically most correlated
>>
>> evaluate after addition of var

>>**Compare results from above methods**

- subsample data and re-run analysis 
> improve performance & understanding

- Split (train & test) - Scale - Split (X & Y)


- Filter
> keep highest correlated attributes
>
> drop if 1+ var highly correlated
>
> keep subset of relevant features
>
> **Pearson pairwise correlation test**
> 
> if the correlation between a pair of features is above a given threshold, you'd remove the one that has larger mean absolute correlation with other features
>
> **SelectKBest** - removes all but the highest scoring features
>> Score: test statistic for the chosen test (ex: chi-squared)

- Wrapper (iterative & computationally expensive process, more acurate)
> feed features to selected ML algorithm 
>
> add/ remove features based on model performance
>> Backward Elimination
>>
>> Forward Selection
>>
>> Bidirectional Elimination 
>>
>> Recursive Feature Elimination

- Embedded: With each iteration of model training, extract important features
> Regularization methods, most common
> penalize a feature given a coefficient threshold.
>> EX: Lasso regularization. 
```python
if feature == irrelevant(determined by given threshhold):
    lasso_penalize = True
    feature_coefficient == 0
    return irrelevant_feature_removal
```
>> Other regularization algorithms: 
>>
>> Elastic Net
>>
>> Ridge Regression
>>
>> Regularized Regression.

- Linear Dimensionality Reduction, Principal component analysis (PCA)
**Unsupervised algorithm**

> Creates linear combinations of the original features
>
> new features = orthogonal = uncorrelated
>
> Rank method: use **explained variance**
>
> **PC1** explains the most variance in your dataset
>
> **PC2** explains the second-most variance
>
> GOAL: keep only min #(principal components) > reach cumulative explained variance of 90%
>
> $\because  $ transformation is dependent on scale
>
> $\therefore $ **ALWAYS NORMALIZE YOUR DATASET BEFORE PERFORMING PCA**
>
> Different PCA pkg, (e.g. kernel PCA, sparse PCA, etc.) - each solve unique obstacle
>
> The new principal components or features are not interpretable

**f-regression**

F statistic 
result from ANOVA test or regression analysis (compare baseline & model) 
if the means between two populations are significantly different

Similar to a T statistic from a T-Test

A T-test will tell you if a single variable is statistically significant.
An F-test will tell you if a group of variables are jointly significant.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler
import warnings
warnings.filterwarnings("ignore")

import env
import wrangle
import split_scale

In [None]:
# PRE-PROCESEE, SPLIT, SCALE

# acquire data and remove null values 
df = wrangle.wrangle_grades()

# split into train and test
train, test = split_scale.split_my_data(df)

# scale data using standard scaler
scaler, train, test = split_scale.standard_scaler(train, test)

# to return to original values
# scaler, train, test = scaling.my_inv_transform(scaler, train, test)

X_train = train.drop(columns='final_grade')
y_train = train[['final_grade']]
X_test = test.drop(columns='final_grade')
y_test = test[['final_grade']]

In [None]:
## FILTER ## 

In [None]:
# FILTER FEATURE 
#Using Pearson Correlation
plt.figure(figsize=(6,5))
cor = train.corr()
sns.heatmap(cor, annot=True, cmap=plt.cm.Reds)
plt.show()

In [None]:
# View only the correlations of each attribute with the target variable, 
# filter down to only those above a certain value

#Correlation with output variable
cor_target = abs(cor["final_grade"])
#Selecting highly correlated features
relevant_features = cor_target[cor_target>0.5]
relevant_features

# output:
#
# exam1          0.984101
# exam2          0.922598
# exam3          0.950309
# final_grade    1.000000
# Name: final_grade, dtype: float64

# may decide to only use one exam (exam1 most correlated) 
# or 
# create new features, such as exam2_delta = exam2 - exam1

In [None]:
# use test statistics to score & select feature
from sklearn.feature_selection import SelectKBest

# Chi
# all value need to be > 0
# might need to scale all value so all > 0
from sklearn.feature_selection import chi2

scaler, train_unscaled, test_unscaled = split_scale.my_inv_transform(scaler, train_scaled=train, test_scaled=test)

X_train_unscaled = train_unscaled.drop(columns='final_grade')
y_train_unscaled = train_unscaled[['final_grade']]

chi_selector = SelectKBest(chi2, k=2)

chi_selector.fit(X_train_unscaled, y_train_unscaled)

chi_support = chi_selector.get_support()
chi_feature = X_train_unscaled.loc[:,chi_support].columns.tolist()

print(str(len(chi_feature)), 'selected features')
print(chi_feature)

# OUTPUT: Selected features, in a list
# 2 selected features
# ['exam1', 'exam2']

In [None]:
# Using f-regression
# F statistic 
# result from ANOVA test or regression analysis (compare baseline & model) 
# if the means between two populations are significantly different

# Similar to a T statistic from a T-Test

# A T-test will tell you if a single variable is statistically significant.
# An F-test will tell you if a group of variables are jointly significant.

from sklearn.feature_selection import f_regression

f_selector = SelectKBest(f_regression, k=2)
f_selector.fit(X_train,y_train)
f_support = f_selector.get_support()

f_feature = X_train.loc[:,f_support].columns.tolist()
print(str(len(f_feature)), 'selected features')
print(f_feature)

# OUTPUT: Selected features, in a list
# 2 selected features
# ['exam1', 'exam3']

In [None]:
## WRAPPER ##

**Backward Elimination using OLS**

- We check the performance of the model 

- iteratively remove the worst performing features one by one till the overall performance of the model comes in acceptable range.

- Performance metric/ measurement pre-determined
> p-value used for demo
```python
if p > 0.05: 
    remove feature
else:  
    keep feature
```    
run evaluation in a loop, return final set of features

In [None]:
import statsmodels.api as sm

# create the OLS object:
ols_model = sm.OLS(y_train, X_train)

# fit the model:
fit = ols_model.fit()

# summarize:
fit.summary()

In [None]:
cols = list(X_train.columns)
pmax = 1
while (len(cols)>0):
    p= []
    X_1 = X_train[cols]
    X_1 = sm.add_constant(X_1)
    model = sm.OLS(y_train,X_1).fit()
    p = pd.Series(model.pvalues.values[1:],index = cols)
    pmax = max(p)
    feature_with_p_max = p.idxmax()
    if(pmax>0.05):
        cols.remove(feature_with_p_max)
    else:
        break

selected_features_BE = cols
print(selected_features_BE) #output as list

**Recursive Feature Elimination (RFE)**

Recursively removes attributes and then builds a model on those attributes that remain.

It uses accuracy metric to rank the feature according to their importance.

**INPUT:**
model to be used 
number of required features

**RETURN**
ranking of all the variables
- 1: most important
- True = relevant feature
- False = irrelevant feature

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import RFE

model = LinearRegression()

#Initializing RFE model, with parameter to select top 2 features. 
rfe = RFE(model, 2)

#Transforming data using RFE
X_rfe = rfe.fit_transform(X_train,y_train)  

#Fitting the data to model
model.fit(X_rfe,y_train)

print(rfe.support_)
print(rfe.ranking_)

In [None]:
number_of_features_list=np.arange(1,3)
high_score=0

#Variable to store the optimum features
number_of_features=0           
score_list =[]

for n in range(len(number_of_features_list)):
    model = LinearRegression()
    rfe = RFE(model,number_of_features_list[n])
    X_train_rfe = rfe.fit_transform(X_train,y_train)
    X_test_rfe = rfe.transform(X_test)
    model.fit(X_train_rfe,y_train)
    score = model.score(X_test_rfe,y_test)
    score_list.append(score)
    if(score>high_score):
        high_score = score
        number_of_features = number_of_features_list[n]

print("Optimum number of features: %d" %number_of_features)
print("Score with %d features: %f" % (number_of_features, high_score))

# output 
# Optimum number of features: 2
# Score with 2 features: 0.965926

In [None]:
cols = list(X_train.columns)
model = LinearRegression()

#Initializing RFE model
rfe = RFE(model, 2)

#Transforming data using RFE
X_rfe = rfe.fit_transform(X_train,y_train)  

#Fitting the data to model
model.fit(X_rfe,y_train)
temp = pd.Series(rfe.support_,index = cols)
selected_features_rfe = temp[temp==True].index

print(selected_features_rfe)

# output
# Index(['exam1', 'exam3'], dtype='object')

In [None]:
## Embedded
from sklearn.linear_model import LassoCV

reg = LassoCV()
reg.fit(X_train, y_train)

print("Best alpha using built-in LassoCV: %f" % reg.alpha_)
print("Best score using built-in LassoCV: %f" %reg.score(X_train,y_train))
coef = pd.Series(reg.coef_, index = X_train.columns)


print("Lasso picked " + str(sum(coef != 0)) + " variables and eliminated the other " +  str(sum(coef == 0)) + " variables")

# output
# Best alpha using built-in LassoCV: 0.001301
# Best score using built-in LassoCV: 0.973583
# Lasso picked 3 variables and eliminated the other 0 variables

# higher α
# the fewer the features have non-zero values
# aka Lasso accepted more features

imp_coef = coef.sort_values()

import matplotlib

matplotlib.rcParams['figure.figsize'] = (4.0, 5.0)
imp_coef.plot(kind = "barh")
plt.title("Feature importance using Lasso Model")

In [None]:
## PCA

from sklearn.decomposition import PCA

pca = PCA(n_components=1, copy=True, whiten=False, svd_solver='auto', random_state=123)
pca.fit(X_train)
X = pca.transform(X_train)
print(pca.n_components_)
print(len(X))
print(pca.explained_variance_ratio_)
print(X[0:5])

In [None]:
########## NOTE FOR EXERCISE CODE ############

# use test statistics to score & select feature
from sklearn.feature_selection import SelectKBest

# Chi
# all value need to be > 0
# might need to scale all value so all > 0
from sklearn.feature_selection import chi2

scaler, train_unscaled, test_unscaled = 
split_scale.my_inv_transform(scaler, train_scaled=train, test_scaled=test)

X_train_unscaled = train_unscaled.drop(columns='final_grade')
y_train_unscaled = train_unscaled[['final_grade']]

chi_selector = SelectKBest(chi2, k=2)

chi_selector.fit(X_train_unscaled, y_train_unscaled)

chi_support = chi_selector.get_support()
chi_feature = X_train_unscaled.loc[:,chi_support].columns.tolist()

print(str(len(chi_feature)), 'selected features')
print(chi_feature)

# OUTPUT: Selected features, in a list

In [None]:
def select_kbest_freg(X_train, y_train (scaled), k): #no1 unscaled #no2 scaled()
    from sklearn.feature_selection import SelectKBest
    from sklearn.feature_selection import f_regression

    
    return a list of the top k features

# 2 selected features
# ['exam1', 'exam3']

In [None]:
########## NOTE FOR EXERCISE CODE ##############
# Using f-regression
# F statistic 
# result from ANOVA test or regression analysis (compare baseline & model) 
# if the means between two populations are significantly different

# Similar to a T statistic from a T-Test

# A T-test will tell you if a single variable is statistically significant.
# An F-test will tell you if a group of variables are jointly significant.

from sklearn.feature_selection import f_regression

f_selector = SelectKBest(f_regression, k=2)
f_selector.fit(X_train,y_train)
f_support = f_selector.get_support()

f_feature = X_train.loc[:,f_support].columns.tolist()
print(str(len(f_feature)), 'selected features')
print(f_feature)

# OUTPUT: Selected features, in a list

- heatmap, correlation
- construct new feature, if 2 var very similar
> NEW = df.exam1 - df.exam2 # constructed new, will be at different scale than original data
> thus definitely need scale

- split-scale

- corr.threshold
- selectkbest, based on a choosen stats, ex: f-regression which is for univariate
```python
from sklearn.feature_selection import SelectKBest, f_regression
X_train = scaled.drop(columns = ['final_grade'])
y_train = scaled[["final_grade"]]

f_selector = SelectKBest(f-regression, k = 3).fit(train)
f_support = f_selector.get_support()

print(X_train.loc[:, f_support].columns.tolist()) # pull the order in original data
f_selector.scores_ # sho score for each var

cols = list(X_train.columns) # get list of column names

```

- backward, use OLS, take all < 0.05, identify, 
- RFE model + recurrsively remove

- embedded, make zero instead of drop



filter - computational limitation when dataset large
> only evaluate
> correlation & selectkbest

wrapper - performance of model to decide feature
>> **backward** 
>>
>> run model, check (ex: use p-value), remove the worst
>>
```python
if pmax>0.05:
    remove
   ```
>>
>> **recursive feature elimination**
>>
>> take the model and number of features provided
>>
>> return ranking of all variables and support (T/F)

```python
reg = LinearRegression()
rfe = RFE(reg, 3)
X_rfe = rfe.fit_transform(X_train, y_train)
```

embedded - adapt, not discard variable

## Model

**Regression**
- supervised machine learning 
>> 
>> Relationship between one (univariate) or more (multivariate) features 
>>
>> how feature(s) contribute to 1 particular outcome - **continuous target variable**

1. Find an algorithm that takes in a set of parameters and returns a predicted data set
2. Identify the 'error function' to report difference between data & model prediction 
3. Find the parameters that minimize this difference.

regression: find **line or plane** that minimizes the errors in our predictions when compared to the labeled data

Univariate: $y = b0 + b1x$

Polynomial: $y = b0 + b1x^2 $

Mutivariate: $y = b0 + b1x + b2x2 + ... + bnxn$ 

 

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.metrics import mean_squared_error, r2_score, explained_variance_score
from math import sqrt
import warnings
warnings.filterwarnings("ignore")

import env
import wrangle
import split_scale
import features

# acquire data and remove null values 
df = wrangle.wrangle_grades()

# split into train and test
train, test = split_scale.split_my_data(df)

# scale data using standard scaler
# use std scaler to scale train & test, store into scaler
scaler, train, test = split_scale.standard_scaler(train, test)

# to return to original values
# scaler, train, test = scaling.my_inv_transform(scaler, train, test)

# assign variable vs target
X_train = train.drop(columns='final_grade')
y_train = train[['final_grade']]
X_test = test.drop(columns='final_grade')
y_test = test[['final_grade']]

# Perform feature selection using RFE
number_of_features = features.optimal_number_of_features(X_train, y_train, 
                                        X_test, y_test)

selected_features = features.optimal_features(X_train, y_train, number_of_features)

X_train = X_train[selected_features]
X_test = X_test[selected_features]


In [None]:
from sklearn.linear_model import LinearRegression

lm1 = LinearRegression()

print(lm1)

# OUTPUT:
# LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)


lm1.fit(X_train, y_train)
print("Linear Model:", lm1)

lm1_y_intercept = lm1.intercept_
print("intercept: ", lm1_y_intercept)

lm1_coefficients = lm1.coef_
print("coefficients: ", lm1_coefficients)

# Output:
# Linear Model: LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
# intercept:  [2.62068042e-17]
# coefficients:  [[0.78604344 0.21040984]]

print('{} = b + m1 * {} + m2 * {}'.format(y_train.columns[0], X_train.columns[0],X_train.columns[1]))
print('    y-intercept  (b): %.2f' % lm1_y_intercept)
print('    coefficient (m1): %.2f' % lm1_coefficients[0][0])
print('    coefficient (m2): %.2f' % lm1_coefficients[0][1])

# Output:
# final_grade = b + m1 * exam1 + m2 * exam3
# y-intercept  (b): 0.00
# coefficient (m1): 0.79
# coefficient (m2): 0.21

## In sample prediction
y_pred_lm1 = lm1.predict(X_train)

## In sample evaluation
## use performance metrics: mean squared error & r-squared values

mse_lm1 = mean_squared_error(y_train, y_pred_lm1)
print("linear model\n  mean squared error: {:.3}".format(mse_lm1)) 

r2_lm1 = r2_score(y_train, y_pred_lm1)
print('  {:.2%} of the variance in the student''s final grade can be explained by the grades on exam 1 and 3.'.format(r2_lm1))

# Output of in sample evaluation
# linear model
# mean squared error: 0.0265
# 97.35% of the variance in the students final grade can be explained by the grades on exam 1 and 3.


# Establish baseline
from math import sqrt

y_pred_baseline = np.array([y_train.mean()[0]]*len(y_train))
MSE = mean_squared_error(y_train, y_pred_baseline)
SSE = MSE*len(y_train)
RMSE = sqrt(MSE)

evs = explained_variance_score(y_train, y_pred_baseline)

print('sum of squared errors\n model: {:.5}'.format(SSE))
print('  {:.2%} of the variance in the student''s final grade can be explained by the grades on exam 1 and 3.'.format(evs))

#output
# sum of squared errors
# model: 81.0
# 0.00% of the variance in the students final grade can be explained by the grades on exam 1 and 3.




**Project Note**

- **predict values** of single unit properties (what's single unit? stand alone house?)

- used by tax district **last transaction: May, 2017 - June, 2017**

Cue

- **Property location** Identify thru **county, state
> property tax calc at county level

- **Distribution of tax rate, county level**
> data: home tax amounts & tax value 
> 
> Deliverable: distribution of tax rate for each county -> tax vs physicaly mapping
>
> want to see how much tax vary within the properties in the county and 
> 
> the rates the bulk of the properties sit around. 
>
> Mapping is separate from Modeling, **tax is the target for modeling**

- 1st model: 
> taxvaluedollarcnt 
>
> estimate properties assessed value
>
> use sqft., #bed/ #bath
>
> may expand from here
>
> be careful to make sure using the best fields to represent sqft, #bed/ #bath
>
> best => the most accurate and available information. 
>
> You will need to do some data investigation in the database and use your domain expertise to make some judgement calls.







**Stakeholder** zillow data science team

state your goals as if you were delivering this to zillow. 
the goals as you understand them and as you have taken and acted upon through your research

**Deliverables** 

1. A report (presentation, both verbal & slides)
> summary of findings
>
> drivers of Zestimate error
>
> come from analysis from exploration phase of the pipeline 
>
> charts form - driver of errors

2. A github repository w/:
> jupyter notebook: walks through the pipeline  
>
> .py files: for model reproduction

Pipeline

- PROJECT PLANNING & README

- Brainstorming ideas, 
- hypotheses, 
- how variables might impact or relate to each other, 
> within independent variables
>
> between the independent variables and dependent variable, 
> 
> new features you may have while first looking at the existing variables 
>
> potential challenge

**README.md**
- description of project 
- instructions for reproducibility

**ACQUIRE**

**acquire.py** 
- Product: data acquired, along with overview of data
- Summary of dataframe 
> first few rows
>
> data types
>
> summary stats
>
> column names
>
> shape of the data frame
>
> etc

**PREP**

**prep.py**
- Product: tidy, ready to be analyzed data.
- Dictionary: data exploration, data tidy methods and logic.


**SPLIT & SCALE**

**split_scale.py**
- Product: 2 dataframes - splited data (train & test), scaled data


**DATA EXPLORATION** 

**explore.py**
- Product: an analysis report of key takeaways & existing relationships among features 

Address each of the questions you posed in your planning and brainstorming 
any others you have come up with along the way through visual or statistical analysis.

product
the findings from your analysis that will be used in your final report, 
answers to specific questions your customers has asked, 
information to move forward toward building a model

Run at least 1 t-test and 1 correlation test (but as many as you need!)

Visualization: relationships variable-target, variable-variable (heatmap?)
> shed light into potential need to discard variable or construct new feature based on 2+ variable


**FEATURE SELECTION**

**feature_selection.py**

- Product: dataframe containing the features selected for model construction. 
- Dictionary of features and methods


**MODELING & EVALUATION**
Goal: develop a regression model that performs better than a baseline.

You must evaluate a baseline model, and show how the model you end up with performs better than that.

model.py: will have the functions to fit, predict and evaluate the model

Your notebook will contain various algorithms and/or hyperparameters tried, along with the evaluation code and results, before settling on the final algorithm.

Be sure and evaluate your model using the standard techniques: plotting the residuals, computing the evaluation metric (SSE, RMSE, and/or MSE), comparing to baseline, plotting 
y
 by 
^
y

lecture note
pd.cut # create bin out of continuous data -> convert to discrete

if dataset big, have to filter, join etc in SQL before pull to local

a variable with more than 1 category (discrete) - need feature engineering

Y = b0 + b1X (1st degree polynomial)
b0: Y-intersect
b0, b1: parameters, required to find Y
polynomial for none-linear

>univatiate (1 feature 1 target) 
>
>multivariate(2+ features, 1 target, y = b1x1 + b2x2)
>
>poly (y = b0 + b1x^2)???

error term -  from real data point to predicted line (for linear regression)

Goal, minimize sum sq error

data integrity, reproducibility
when imported into another file (ex: explore.ipynb or report.ipynb), can be used in different places
encapsulate the data acquisition sop


clone where my partner left off

git clone + paste the url from github 

essentially make a local copy of the files contributed from my partners

cp ../statistics-exercises/env.py ./ # copy from parent to here

git checkout file_name.py # go back in time...

after a day of work, create a readme document stuff to pay attention ex: remind they need to add their own env.py