<a href="https://colab.research.google.com/github/SarmenSinanian/DS-Unit-2-Applied-Modeling/blob/master/Sarmen_Sinanian_assignment_applied_modeling_ProjectV1_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science, Unit 2: Predictive Modeling

# Applied Modeling, Module 2

You will use your portfolio project dataset for all assignments this sprint.

## Assignment

Complete these tasks for your project, and document your work.

- [ ] Plot the distribution of your target. 
    - Regression problem: Is your target skewed? Then, log-transform it.
    - Classification: Are your classes imbalanced? Then, don't use just accuracy. And try `class_balance` parameter in scikit-learn.
- [ ] Continue to clean and explore your data. Make exploratory visualizations.
- [ ] Fit a model. Does it beat your baseline?
- [ ] Share at least 1 visualization on Slack.

You need to complete an initial model today, because the rest of the week, we're making model interpretation visualizations.


## Reading

### Today
- [imbalance-learn](https://github.com/scikit-learn-contrib/imbalanced-learn)
- [Learning from Imbalanced Classes](https://www.svds.com/tbt-learning-imbalanced-classes/)
- [Machine Learning Meets Economics](http://blog.mldb.ai/blog/posts/2016/01/ml-meets-economics/)
- [ROC curves and Area Under the Curve explained](https://www.dataschool.io/roc-curves-and-auc-explained/)
- [The philosophical argument for using ROC curves](https://lukeoakdenrayner.wordpress.com/2018/01/07/the-philosophical-argument-for-using-roc-curves/)


### Yesterday
- [Attacking discrimination with smarter machine learning](https://research.google.com/bigpicture/attacking-discrimination-in-ml/), by Google Research, with  interactive visualizations. _"A threshold classifier essentially makes a yes/no decision, putting things in one category or another. We look at how these classifiers work, ways they can potentially be unfair, and how you might turn an unfair classifier into a fairer one. As an illustrative example, we focus on loan granting scenarios where a bank may grant or deny a loan based on a single, automatically computed number such as a credit score."_
- [How Shopify Capital Uses Quantile Regression To Help Merchants Succeed](https://engineering.shopify.com/blogs/engineering/how-shopify-uses-machine-learning-to-help-our-merchants-grow-their-business)
- [Maximizing Scarce Maintenance Resources with Data: Applying predictive modeling, precision at k, and clustering to optimize impact](https://towardsdatascience.com/maximizing-scarce-maintenance-resources-with-data-8f3491133050), **by Lambda DS3 student** Michael Brady. His blog post extends the Tanzania Waterpumps scenario, far beyond what's in the lecture notebook.
- [Notebook about how to calculate expected value from a confusion matrix by treating it as a cost-benefit matrix](https://github.com/podopie/DAT18NYC/blob/master/classes/13-expected_value_cost_benefit_analysis.ipynb)
- [Simple guide to confusion matrix terminology](https://www.dataschool.io/simple-guide-to-confusion-matrix-terminology/) by Kevin Markham, with video
- [Visualizing Machine Learning Thresholds to Make Better Business Decisions](https://blog.insightdatascience.com/visualizing-machine-learning-thresholds-to-make-better-business-decisions-4ab07f823415)






In [0]:
# conda install -c conda-forge category_encoders
# conda update -n base -c defaults conda
# pip install --upgrade category_encoders

In [0]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt


# BELOW DATASET FROM https://query1.finance.yahoo.com/v7/finance/download/SPY?period1=728294400&period2=1566889200&interval=1d&events=history&crumb=ixT1ci5YI3E

In [3]:
# Setting specific columns to use (using unadjusted close and not
#  accounting for the splits and dividends)
columns = ['Date','Close','Volume']

# Calling data set
spy = pd.read_csv(r'E:\Desktop\Lambda_School\Assignments\Unit 2 Sprint 7 PROJECT\SPY.csv',
                  usecols = columns)

FileNotFoundError: ignored

In [0]:
# Checking columns
spy.head()

In [0]:
spy.describe()

In [0]:
spy.shape

In [0]:
spy.isna().sum()

In [0]:
# Changing Dat to datetime format
spy['Date'] = pd.to_datetime(spy['Date'])

In [0]:
# Visualizing total dataset without volumne

plt.plot_date(spy['Date'], spy['Close'])

# Choose your target. Which column in your tabular dataset will you predict?


In [0]:
#  PRICE(CLOSE) NEXT DAY ABOVE/BELOW PREVIOUS DAY BASED ON ROLLING MEAN(SMA) OR RELATIVE STRENGTH (RSI)***

# Choose which observations you will use to train, validate, and test your model. And which observations, if any, to exclude.


In [0]:
#*WILL USE ALL SPY (S&P 500 ETF) DATA*

# Determine whether your problem is regression or classification.


In [0]:
# *CLASSIFICATION (IS THIS TICKER OVER/UNDER THE X_DAY ROLLING MEAN ***AND*** ALSO OVER/UNDER BOUGHT ON THE RSI?)*
# *AKA 3 WAY CONFUSION MATRIX WITH UNDER TO BOTH AS THE HIGHEST LIKELIHOOD PREDICTOR OF NEXT DAY/WEEKS/MONTHS POSITIVE RETURNS*

# Choose your evaluation metric.

In [0]:
# WILL USE ACCURACY SCORE

# Begin with baselines: majority class baseline for classification, or mean baseline for regression, with your metric of choice.


In [0]:
spy.head()

In [0]:
spy.dtypes

In [0]:
from sklearn.model_selection import train_test_split

In [0]:
spy['Date'] = pd.to_datetime(spy['Date'])
spy['Year'] = spy['Date'].dt.year

In [0]:
spy.head()

In [0]:
spy.dtypes

### *NEITHER .ROLLING_MEAN NOR .ROLLING WORK*

In [0]:
# spy['SMA'] = spy['Close'].rolling(window = 14, min_periods = 14, axis = 0)

In [0]:
# spy['SMA'] = pd.rolling_mean(spy['Close'], min_periods = 14, window = 14)

In [0]:
spy.isnull().sum()

In [0]:
spy.head()

In [0]:
spy.head()

# Begin with baselines: majority class baseline for classification, or mean baseline for regression, with your metric of choice.


In [0]:
spy['SMA'] = spy.Close.rolling(window=14).mean()
spy['SMA_Yesterday'] = spy['SMA'].shift(1)

In [0]:
spy.head(15)

In [0]:
spy.tail()

In [0]:
spy.isna().sum()

In [0]:
spy.dtypes

In [0]:
# spy['Close_Higher'] = np.where(spy['Close'] > spy['Close'].shift(-1), 'True','False')

In [0]:
# spy_numeric = ['Close']

In [0]:
spy['Above_14D_SMA_Yesterday'] = np.where(spy['SMA'].shift(1)<spy['Close'].shift(1), 0,1)

In [0]:
spy['Below_14D_SMA_Yesterday'] = np.where(spy['SMA'].shift(1)>spy['Close'].shift(1), 0,1)

In [0]:
# spy['Above_14D_SMA_Yesterday'] = np.where(spy['SMA']>spy['Close'], 'True','False')

In [0]:
spy.Above_14D_SMA_Yesterday.value_counts(normalize=True)

In [0]:
spy_numeric_diff = spy[['Close']].diff()[1:]
# cond1 = spy_numeric_diff[['Close']] >=0
spy['Close_Higher_Than_Yesterday'] = np.insert(np.where(spy_numeric_diff[['Close']] >=0, '1','0'), 0, np.nan)

In [0]:
spy.Close_Higher_Than_Yesterday.value_counts(normalize=True)

In [0]:
y_train = spy['Close_Higher_Than_Yesterday']

In [0]:
majority_class = y_train.mode()[0]

In [0]:
y_pred = [majority_class]*len(y_train)

In [0]:
from sklearn.metrics import accuracy_score
accuracy_score(y_train, y_pred)

# Begin to clean and explore your data.

In [0]:
spy_2019 = spy[spy['Year'] == 2019]
spy_2018 = spy[spy['Year'] == 2018]
spy_2017 = spy[spy['Year'] == 2017]
spy_2016 = spy[spy['Year'] == 2016]
spy_2015 = spy[spy['Year'] == 2015]
spy_2014 = spy[spy['Year'] == 2014]
spy_2013 = spy[spy['Year'] == 2013]
spy_2012 = spy[spy['Year'] == 2012]
spy_2011 = spy[spy['Year'] == 2011]
spy_2010 = spy[spy['Year'] == 2010]
spy_2009 = spy[spy['Year'] == 2009]
spy_2008 = spy[spy['Year'] == 2008]
spy_2007 = spy[spy['Year'] == 2007]
spy_2006 = spy[spy['Year'] == 2006]
spy_2005 = spy[spy['Year'] == 2005]
spy_2004 = spy[spy['Year'] == 2004]
spy_2003 = spy[spy['Year'] == 2003]
spy_2002 = spy[spy['Year'] == 2002]
spy_2001 = spy[spy['Year'] == 2001]
spy_2000 = spy[spy['Year'] == 2000]
spy_1999 = spy[spy['Year'] == 1999]
spy_1998 = spy[spy['Year'] == 1998]
spy_1997 = spy[spy['Year'] == 1997]
spy_1996 = spy[spy['Year'] == 1996]
spy_1995 = spy[spy['Year'] == 1995]
spy_1994 = spy[spy['Year'] == 1994]
spy_1993 = spy[spy['Year'] == 1993]

# Plot the distribution of your target.
### Regression problem: Is your target skewed? Then, log-transform it.
### Classification: Are your classes imbalanced? Then, don't use just accuracy. And try class_balance parameter in scikit-learn.


 

In [0]:
spy.tail(25)

In [0]:
spy_1994_2013 = pd.concat([spy_1994,spy_1995,spy_1996,spy_1997,spy_1998,
                           spy_1999,spy_2000,spy_2001,spy_2002,spy_2003,
                           spy_2004,spy_2005,spy_2006,spy_2007,spy_2008,
                           spy_2009,spy_2010,spy_2011,spy_2012,spy_2013])
spy_1994_2013.describe()

In [0]:
target = ['Close_Higher_Than_Yesterday']
drop = ['Date','Year']


train = spy_1994_2013.drop(columns=drop)
test = spy_2017.drop(columns=drop)
val = spy_2019.drop(columns=drop)

X_val = val.drop(columns=target)
y_val = val[target]

X_test = test.drop(columns=target)
y_test = test[target]

X_train = train.drop(columns=target)
y_train = train[target]

In [0]:
X_test.head()

In [0]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import make_pipeline

In [0]:
pipeline = make_pipeline(
    RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
)

pipeline.fit(X_train, y_train)
print('MAJORITY CLASS Validation Accuracy', pipeline.score(X_val, y_val))

In [0]:
from sklearn.metrics import confusion_matrix
from sklearn.utils.multiclass import unique_labels

y_pred = pipeline.predict(X_val)

confusion_matrix(y_val, y_pred)

In [0]:
def plot_confusion_matrix(y_true, y_pred):
    labels = unique_labels(y_true)
    columns = [f'Predicted {label}' for label in labels]
    index = [f'Actual {label}' for label in labels]
    return columns, index

plot_confusion_matrix(y_val, y_pred)

In [0]:
def plot_confusion_matrix(y_true, y_pred):
    labels = unique_labels(y_true)
    columns = [f'Predicted {label}' for label in labels]
    index = [f'Actual {label}' for label in labels]
    table = pd.DataFrame(confusion_matrix(y_true, y_pred),
                         columns=columns, index=index)
    return table

plot_confusion_matrix(y_val, y_pred)

In [0]:
import seaborn as sns

def plot_confusion_matrix(y_true, y_pred):
    labels = unique_labels(y_true)
    columns = [f'Predicted {label}' for label in labels]
    index = [f'Actual {label}' for label in labels]
    table = pd.DataFrame(confusion_matrix(y_true, y_pred),
                         columns=columns, index=index)
    return sns.heatmap(table, annot=True, fmt='d', cmap='viridis')

plot_confusion_matrix(y_val, y_pred);

In [0]:
from sklearn.metrics import classification_report
print(classification_report(y_val, y_pred))

# Continue to clean and explore your data. Make exploratory visualizations.

In [0]:
import pandas as pd

Close = spy['Close']

# Get the difference in price from previous step

delta = Close.diff()

# Get rid of the first row, which is NaN since it did not have a previous 
# row to calculate the differences
delta = delta[1:] 

# Make the positive gains (up) and negative gains (down) Series
up, down = delta.copy(), delta.copy()
up[up < 0] = 0
down[down > 0] = 0

# # Calculate the EWMA

spy['Roll_Up'] = up.shift(1)
spy['Roll_Down'] = down.abs().shift(1)

spy['Roll_Up1'] = spy['Roll_Up'].ewm(com=7).mean()
spy['Roll_Down1'] = spy['Roll_Down'].ewm(com=7).mean()

# # Calculate the RSI based on EWMA

RS1 = spy['Roll_Up1'] / spy['Roll_Down1']
RSI1 = 100.0 - (100.0 / (1.0 + RS1))

spy['RSI_Yesterday_EXP'] = RSI1

# Calculate the SMA
spy['Roll_Up2'] = spy['Roll_Up'].rolling(window = 7).mean()
spy['Roll_Down2'] = spy['Roll_Down'].rolling(window = 7).mean()

# Calculate the RSI based on SMA
RS2 = spy['Roll_Up2'] / spy['Roll_Down2']
RSI2 = 100.0 - (100.0 / (1.0 + RS2))

spy['RSI_Yesterday_SMA'] = RSI2

# Compare graphically
plt.figure()
RSI1.plot()
RSI2.plot()
plt.legend(['RSI via EWMA', 'RSI via SMA'])
plt.show()

In [0]:
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

In [0]:
spy.head(20)

In [0]:
spy.tail()

In [0]:
spy['Overbought_Yesterday_EXP'] = spy['RSI_Yesterday_EXP'].shift(1) > 70.0
spy['Oversold_Yesterday_EXP'] = spy['RSI_Yesterday_EXP'].shift(1) < 30.0

spy['Overbought_Yesterday_SMA'] = spy['RSI_Yesterday_SMA'].shift(1) > 70.0
spy['Oversold_Yesterday_SMA'] = spy['RSI_Yesterday_SMA'].shift(1) < 30.0

In [0]:
spy['Overbought_Yesterday_EXP'].replace(to_replace=False,value=0, inplace=True)
spy['Oversold_Yesterday_EXP'].replace(to_replace=False,value=0, inplace=True)

spy['Overbought_Yesterday_SMA'].replace(to_replace=False,value=0, inplace=True)
spy['Oversold_Yesterday_SMA'].replace(to_replace=False,value=0, inplace=True)

In [0]:
spy.head()

In [0]:
spy['Oversold_EXP_And_Under_14D_SMA_Yesterday'] = ((spy['Oversold_Yesterday_EXP'] ==1) & (spy['Below_14D_SMA_Yesterday'] == 0))
spy['Oversold_SMA_And_Under_14D_SMA_Yesterday'] = ((spy['Oversold_Yesterday_SMA'] ==1) & (spy['Below_14D_SMA_Yesterday'] == 0))

In [0]:
spy['Oversold_EXP_And_Under_14D_SMA_Yesterday'].replace(to_replace=False,value=0, inplace=True)
spy['Oversold_SMA_And_Under_14D_SMA_Yesterday'].replace(to_replace=False,value=0, inplace=True)

In [0]:
spy.head()

In [0]:
spy.Oversold_EXP_And_Under_14D_SMA_Yesterday.value_counts()

In [0]:
spy.Oversold_SMA_And_Under_14D_SMA_Yesterday.value_counts()

In [0]:
spy['Volume_Yesterday'] = spy['Volume'].shift(1)

In [0]:
spy_2019 = spy[spy['Year'] == 2019]
spy_2018 = spy[spy['Year'] == 2018]
spy_2017 = spy[spy['Year'] == 2017]
spy_2016 = spy[spy['Year'] == 2016]
spy_2015 = spy[spy['Year'] == 2015]
spy_2014 = spy[spy['Year'] == 2014]
spy_2013 = spy[spy['Year'] == 2013]
spy_2012 = spy[spy['Year'] == 2012]
spy_2011 = spy[spy['Year'] == 2011]
spy_2010 = spy[spy['Year'] == 2010]
spy_2009 = spy[spy['Year'] == 2009]
spy_2008 = spy[spy['Year'] == 2008]
spy_2007 = spy[spy['Year'] == 2007]
spy_2006 = spy[spy['Year'] == 2006]
spy_2005 = spy[spy['Year'] == 2005]
spy_2004 = spy[spy['Year'] == 2004]
spy_2003 = spy[spy['Year'] == 2003]
spy_2002 = spy[spy['Year'] == 2002]
spy_2001 = spy[spy['Year'] == 2001]
spy_2000 = spy[spy['Year'] == 2000]
spy_1999 = spy[spy['Year'] == 1999]
spy_1998 = spy[spy['Year'] == 1998]
spy_1997 = spy[spy['Year'] == 1997]
spy_1996 = spy[spy['Year'] == 1996]
spy_1995 = spy[spy['Year'] == 1995]
spy_1994 = spy[spy['Year'] == 1994]
spy_1993 = spy[spy['Year'] == 1993]

In [0]:
spy_1994_2013 = spy[(spy['Year'] >= 1994) & (spy['Year'] <=2013)]
spy_1994_2017 = spy[(spy['Year'] >= 1994) & (spy['Year'] <=2017)]
spy_2014_2019 = spy[(spy['Year'] >=2014) & (spy['Year'] <=2019)]

In [0]:
spy_2010_2013 = pd.concat([spy_2010,spy_2011,spy_2012,spy_2013])
spy_2010_2013.head()

In [0]:
spy_2010_2013.columns

In [0]:
# 2018 training data

# 0.0091 ± 0.0183	SMA_Yesterday
# 0.0061 ± 0.0000	Oversold_SMA_And_Under_14D_SMA_Yesterday
# 0.0061 ± 0.0000	Oversold_Yesterday_SMA
# 0 ± 0.0000	Oversold_EXP_And_Under_14D_SMA_Yesterday
# 0 ± 0.0000	Oversold_Yesterday_EXP
# -0.0030 ± 0.0183	Overbought_Yesterday_SMA
# -0.0061 ± 0.0000	Above_14D_SMA_Yesterday
# -0.0091 ± 0.0183	Overbought_Yesterday_EXP
# -0.0091 ± 0.0427	RSI_Yesterday_SMA
# -0.0091 ± 0.0305	Below_14D_SMA_Yesterday
# -0.0122 ± 0.0488	Volume_Yesterday
# -0.0213 ± 0.0427	RSI_Yesterday_EXP

In [0]:
# 1994-2013 training data

# 0 ± 0.0000	Oversold_Yesterday_EXP
# 0 ± 0.0000	SMA_Yesterday
# 0 ± 0.0000	Year
# -0.0040 ± 0.0000	Oversold_EXP_And_Under_14D_SMA_Yesterday
# -0.0040 ± 0.0000	Overbought_Yesterday_EXP
# -0.0060 ± 0.0120	Oversold_SMA_And_Under_14D_SMA_Yesterday
# -0.0159 ± 0.0159	Overbought_Yesterday_SMA
# -0.0199 ± 0.0000	RSI_Yesterday_EXP
# -0.0219 ± 0.0120	Volume_Yesterday

In [0]:
# BELOW IS THE DROP FOR TRAIN 1994-2013 AND VAL 2019

# 0.0823 ± 0.0427	Overbought_Yesterday_SMA
# 0.0671 ± 0.0366	RSI_Yesterday_EXP
# 0.0549 ± 0.0122	RSI_Yesterday_SMA
# 0.0335 ± 0.0061	Above_14D_SMA_Yesterday
# 0.0305 ± 0.0000	Below_14D_SMA_Yesterday
# 0.0061 ± 0.0122	Overbought_Yesterday_EXP
# 0 ± 0.0000	Volume_Yesterday
# 0 ± 0.0000	Oversold_SMA_And_Under_14D_SMA_Yesterday
# 0 ± 0.0000	Oversold_EXP_And_Under_14D_SMA_Yesterday
# 0 ± 0.0000	Oversold_Yesterday_EXP
# 0 ± 0.0000	SMA_Yesterday
# -0.0030 ± 0.0061	Oversold_Yesterday_SMA

# DROP BELOW AFTER TRAINING/VAL PERMUTATION IMPORTANCE

# 0.0030 ± 0.0305	Overbought_Yesterday_SMA
# 0.0030 ± 0.0061	Overbought_Yesterday_EXP
# -0.0030 ± 0.0061	Below_14D_SMA_Yesterday
# -0.0457 ± 0.0793	RSI_Yesterday_SMA

In [0]:
# 0.0183 ± 0.0122	Overbought_Yesterday_SMAx
# 0.0091 ± 0.0061	Below_14D_SMA_Yesterday
# 0.0061 ± 0.0000	Oversold_SMA_And_Under_14D_SMA_Yesterday
# 0.0061 ± 0.0122	Overbought_Yesterday_EXP
# 0 ± 0.0000	Oversold_EXP_And_Under_14D_SMA_Yesterday
# 0 ± 0.0000	SMA_Yesterday
# 0 ± 0.0000	Year
# -0.0061 ± 0.0000	Oversold_Yesterday_SMA
# -0.0061 ± 0.0122	Above_14D_SMA_Yesterday
# -0.0091 ± 0.0061	Oversold_Yesterday_EXP
# -0.0091 ± 0.0427	RSI_Yesterday_EXP
# -0.0122 ± 0.0000	Volume_Yesterday
# -0.0274 ± 0.0305	RSI_Yesterday_SMAx

In [0]:
target = 'Close_Higher_Than_Yesterday'
# drop = ['Date','Year','SMA','Volume','Adj_Close','Roll_Up','Roll_Down','Oversold_And_Under_14D_SMA_Yesterday']
# drop = ['Date','Year','SMA','Volume','Adj_Close','Roll_Up','Roll_Down', 'Above_14D_SMA_Yesterday',
#         'Overbought_Yesterday','RSI']

# drop = ['Date','Volume','SMA','Roll_Up','Roll_Up1','Roll_Up2','Roll_Down','Roll_Down1',
#         'Close','Roll_Down2','RSI_Yesterday_SMA','Oversold_Yesterday_EXP','SMA_Yesterday',
#         'Year','Oversold_EXP_And_Under_14D_SMA_Yesterday','Overbought_Yesterday_EXP',
#         'Oversold_SMA_And_Under_14D_SMA_Yesterday','Overbought_Yesterday_SMA','RSI_Yesterday_EXP',
#         'Volume_Yesterday']


# drop = ['Date','Volume','SMA','Roll_Up','Roll_Up1','Roll_Up2','Roll_Down','Roll_Down1',
#         'Close','Roll_Down2','RSI_Yesterday_EXP','Volume_Yesterday','Below_14D_SMA_Yesterday',
#         'Year','RSI_Yesterday_SMA','Overbought_Yesterday_EXP','Above_14D_SMA_Yesterday',
#         'Overbought_Yesterday_SMA'
#         ]

# BELOW ARE STANDARD DROPS (SOME CONTAIN FUTURE LEAKAGE)

# drop = ['Date','Volume','SMA','Roll_Up','Roll_Up1','Roll_Up2','Roll_Down','Roll_Down1',
#         'Close','Roll_Down2']

# BELOW IS THE DROP FOR TRAIN 1994-2013 AND VAL 2019

#1
# drop = ['Date','Volume','SMA','Roll_Up','Roll_Up1','Roll_Up2','Roll_Down','Roll_Down1',
#         'Close','Roll_Down2','Year','Volume_Yesterday','Oversold_SMA_And_Under_14D_SMA_Yesterday',
#         'Oversold_EXP_And_Under_14D_SMA_Yesterday','Oversold_Yesterday_EXP','SMA_Yesterday',
#         'Oversold_Yesterday_SMA','Overbought_Yesterday_SMA','Overbought_Yesterday_EXP',
#         'Below_14D_SMA_Yesterday','RSI_Yesterday_SMA']

#2
# drop = ['Date','Volume','SMA','Roll_Up','Roll_Up1','Roll_Up2','Roll_Down','Roll_Down1',
#         'Close','Roll_Down2','Overbought_Yesterday_EXP','Oversold_Yesterday_SMA',
#         'Year','SMA_Yesterday','RSI_Yesterday_SMA','Oversold_EXP_And_Under_14D_SMA_Yesterday',
#         'Oversold_SMA_And_Under_14D_SMA_Yesterday','Oversold_Yesterday_EXP','RSI_Yesterday_EXP',
#         'Above_14D_SMA_Yesterday']

#3
drop = ['Date','Volume','SMA','Roll_Up','Roll_Up1','Roll_Up2','Roll_Down','Roll_Down1',
        'Close','Roll_Down2','RSI_Yesterday_SMA','Overbought_Yesterday_SMA','Volume_Yesterday',
        'Oversold_EXP_And_Under_14D_SMA_Yesterday','Year','SMA_Yesterday','Below_14D_SMA_Yesterday',
        'Oversold_SMA_And_Under_14D_SMA_Yesterday','Overbought_Yesterday_EXP','Oversold_Yesterday_EXP',
        'Oversold_Yesterday_SMA','Above_14D_SMA_Yesterday']

train = spy_1994_2013.drop(columns=drop)
test = spy_2015.drop(columns=drop)
val = spy_2019.drop(columns=drop)

X_val = val.drop(columns=target)
y_val = val[target]

X_test = test.drop(columns=target)
y_test = test[target]

X_train = train.drop(columns=target)
y_train = train[target]

In [0]:
y_val.value_counts()

In [0]:
y_train.value_counts()

In [0]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import make_pipeline

In [0]:
pipeline = make_pipeline(
    RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
)

pipeline.fit(X_train, y_train)
print('Validation Accuracy', pipeline.score(X_val, y_val))

In [0]:
y_val.describe()

In [0]:
y_pred=pipeline.predict(X_val)

plot_confusion_matrix(y_val,y_pred)

In [0]:
print(classification_report(y_val, y_pred))

# Fit a model. Does it beat your baseline?

### ROC AUC (GIVING GENERIC VALUES OF .5 ...)

In [0]:
# from sklearn.metrics import roc_auc_score

# y_pred_proba = np.full_like(y_val, fill_value=1.00)
# roc_auc_score(y_val, y_pred_proba)

# y_pred_proba = np.full_like(y_val, fill_value=0)
# roc_auc_score(y_val, y_pred_proba)

# y_pred_proba = np.full_like(y_val, fill_value=0.50)
# roc_auc_score(y_val, y_pred_proba)

# BELOW THROWS ERROR:
# UndefinedMetricWarning: No positive samples in y_true, true positive value should be meaningless

In [0]:
# import matplotlib.pyplot as plt
# from sklearn.metrics import roc_curve
# fpr, tpr, thresholds = roc_curve(y_val=='Charged Off', y_pred_proba)
# plt.plot(fpr, tpr)
# plt.title('ROC curve')
# plt.xlabel('False Positive Rate')
# plt.ylabel('True Positive Rate');

In [0]:
y_val.value_counts()

In [0]:
import category_encoders as ce
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

lr = make_pipeline(
    ce.OrdinalEncoder(), # Not ideal for Linear Regression 
    StandardScaler(), 
    LinearRegression()
)

lr.fit(X_train, y_train)
print('Linear Regression R^2', lr.score(X_val, y_val))

In [0]:
X_val.columns

In [0]:
X_val.RSI_Yesterday_EXP.value_counts()

In [0]:
X_val_example = X_val[X_val['RSI_Yesterday_EXP'] <= 30]
X_val_example

In [0]:
example = X_val_example.iloc[[0]]
example

In [0]:
pred = lr.predict(example)[0]
print(f'Predicted Probability Close Higher Today: {pred:.2f}')

In [0]:
example2 = X_val_example.iloc[[1]]
pred2 = lr.predict(example2)[0]
print(f'Predicted Probability Close Higher Today: {pred2:.2f}')

In [0]:
example2

In [0]:
example3 = X_val.iloc[[4]]
pred3 = lr.predict(example3)[0]
print(f'Predicted Probability Close Higher Today: {pred3:.2f}')

In [0]:
example3

In [0]:
import matplotlib.pyplot as plt
plt.rcParams['figure.dpi'] = 72

In [0]:
# conda install -c conda-forge category_encoders
# pip install category_encoders
# pip install plotly==4.1.0
# conda install -c conda-forge eli5 

import eli5
from eli5.sklearn import PermutationImportance

model = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
model.fit(X_train, y_train)

permuter = PermutationImportance(
    model, scoring='accuracy', n_iter=2, random_state=42
)

permuter.fit(X_val, y_val)
feature_names = X_val.columns.tolist()
eli5.show_weights(
    permuter,
    top=None,
    feature_names = feature_names
)

# DO XGBOOST IN COLAB

In [0]:
# !conda install -c mndrake xgboost

In [0]:
# from sklearn.metrics import r2_score
# from xgboost import XGBRegressor

# gb = make_pipeline(
#     ce.OrdinalEncoder(), 
#     XGBRegressor(n_estimators=200, objective='reg:squarederror', n_jobs=-1)
# )

# gb.fit(X_train, y_train_log)
# y_pred_log = gb.predict(X_val)
# y_pred = np.expm1(y_pred_log)
# print('Gradient Boosting R^2', r2_score(y_val, y_pred))

In [0]:
# pip install pdpbox

In [0]:
# from pdpbox.pdp import pdp_isolate, pdp_plot

# feature = 'Close'

# isolated = pdp_isolate(
#     model=gb, 
#     dataset=X_val, 
#     model_features=X_val.columns, 
#     feature=feature
# )

# pdp_plot(isolated, feature_name=feature);

In [0]:
# spy.iloc[[2]].to_string()

In [0]:
# spy.iloc[[2]]