Lambda School Data Science

*Unit 2, Sprint 3, Module 2*

---


# Wrangle ML datasets

- [ ] Continue to clean and explore your data. 
- [ ] For the evaluation metric you chose, what score would you get just by guessing?
- [ ] Can you make a fast, first model that beats guessing?

**We recommend that you use your portfolio project dataset for all assignments this sprint.**

**But if you aren't ready yet, or you want more practice, then use the New York City property sales dataset for today's assignment.** Follow the instructions below, to just keep a subset for the Tribeca neighborhood, and remove outliers or dirty data. [Here's a video walkthrough](https://youtu.be/pPWFw8UtBVg?t=584) you can refer to if you get stuck or want hints!

- Data Source: [NYC OpenData: NYC Citywide Rolling Calendar Sales](https://data.cityofnewyork.us/dataset/NYC-Citywide-Rolling-Calendar-Sales/usep-8jbt)
- Glossary: [NYC Department of Finance: Rolling Sales Data](https://www1.nyc.gov/site/finance/taxes/property-rolling-sales-data.page)

In [111]:
# Imports


import matplotlib.pyplot as plt
import pandas as pd
import plotly.express as px
import seaborn as sns

from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, roc_auc_score, roc_curve, classification_report
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler


In [137]:
# Wrangle Data

pd.set_option('display.max_rows', 150)

def readIn(file, parse_d='DATE', idx='DATE'):
    """ Opens .csv file, creates datetime index, and returns DataFrame"""

    DATA_PATH = '../data/build_finance/'
    df = pd.read_csv(DATA_PATH+file,
                     parse_dates=[parse_d]).set_index(idx)
    return df


def manyToOne(files_m, files_q, file_w):
    """Accepts lists of .csv files and returns single DataFrame"""

    # Takes monthly and quarterly files, has them read_in, parses their
    # dates, and return DataFrames held in lists
    frames_m = [readIn(file) for file in files_m]
    frames_q = [readIn(file) for file in files_q]

    # Reads in SP500 data, indicates columns to use, and capitalizes 'DATE'
    # for consistency with other DataFrames held in frames_m and frames_q
    DATA_PATH = '../data/build_finance/'
    sp = pd.read_csv(DATA_PATH+file_w, usecols=['Date', 'Close'],
                     parse_dates=['Date']).set_index('Date')
    sp.rename(columns={'Date': 'DATE', 'Close': 'SP500_CLOSE'}, inplace=True)

    # Concatenate DataFrames held in frames_m and frames_q
    concat_m = pd.concat(frames_m, axis=1)
    concat_q = pd.concat(frames_q, axis=1)

    # Final concatenation of all DataFrames (monthly data, quarterly data 
    # and SP500 data)
    last = pd.concat([concat_m, concat_q, sp], axis=1)

    # SP500 data is only available from 1/1/1985
    # Mask out dates prior to January 1st, 1985
    mask = last.index >= '1985-01-01'
    df_final = last[mask]
    df_final.shape

    # Return DataFrame
    return df_final


def wrangle(files_m, files_q, file_w):
    # Pass .csv files and have a single DataFrame returned
    df = manyToOne(files_m, files_q, file_w)

    # Rename columns
    df.columns = ['cpi', '10yr_treasury', 'housing_starts', 
                  'industrial_prod', 'initial_claims', 'unemployment_rate', 
                  'corp_profits', 'exports_goods_svs', 'gdp', 'net_exports',
                  'sp500_close']

    # Reorganize columns
    cols_reorder = ['corp_profits', 'exports_goods_svs', 'net_exports', 
                    'gdp', '10yr_treasury', 'cpi', 'industrial_prod', 
                    'unemployment_rate', 'initial_claims', 'housing_starts', 
                    'sp500_close']

    df = df.reindex(columns=cols_reorder)

    # Create a target feature, month-ahead-return: positive/negative - 
    # Binary Classification
    df['sp_ahead_pos_neg'] = (df['sp500_close'].shift(-1) - df['sp500_close']) > 0

    # Drop leaky feature
    df.drop(columns='sp500_close', inplace=True)

    # Forward fill the quarterly data
    df.ffill(inplace=True)

    # Create new feature the shows the montly change in monthly initial unemployment 
    # claims
    # df['change_initial_claims'] = df['initial_claims'] / df['initial_claims'].shift(+1)

    # # Drop 'initial_claims'
    # df.drop(columns='initial_claims')

    # Return wrangled DataFrame
    return df


file_w = '^GSPC_m.csv'

files_m = ['CPI.csv',
           'DGS10.csv',
           'HOUST.csv',
           'INDPRO.csv',
           'INITCLMS.csv',
           'UNRATE.csv']

files_q = ['CP.csv',
           'EXPGS.csv',
           'GDP.csv',
           'NETEXP.csv',]


#df = manyToOne(files_m, files_q, file_w)

df = wrangle(files_m, files_q, file_w)
print(df.shape)
df.head(10)

(427, 11)


Unnamed: 0,corp_profits,exports_goods_svs,net_exports,gdp,10yr_treasury,cpi,industrial_prod,unemployment_rate,initial_claims,housing_starts,sp_ahead_pos_neg
1985-01-01,204.768,306.01,-91.298,7824.247,11.384286,105.7,56.1398,7.3,370750.0,1711.0,True
1985-02-01,204.768,306.01,-91.298,7824.247,11.508889,106.3,56.3323,7.2,391750.0,1632.0,False
1985-03-01,204.768,306.01,-91.298,7824.247,11.855238,106.8,56.4232,7.2,380800.0,1800.0,False
1985-04-01,205.85,304.126,-114.445,7893.136,11.434762,107.0,56.2693,7.3,398250.0,1821.0,True
1985-05-01,205.85,304.126,-114.445,7893.136,10.846818,107.2,56.3488,7.2,392750.0,1680.0,True
1985-06-01,205.85,304.126,-114.445,7893.136,10.156,107.5,56.3901,7.4,392400.0,1676.0,False
1985-07-01,209.89,297.273,-116.895,8013.674,10.306818,107.7,56.0234,7.4,378250.0,1684.0,False
1985-08-01,209.89,297.273,-116.895,8013.674,10.331818,107.9,56.2555,7.1,404400.0,1743.0,False
1985-09-01,209.89,297.273,-116.895,8013.674,10.373158,108.1,56.4983,7.1,408500.0,1676.0,True
1985-10-01,211.649,305.433,-133.434,8073.239,10.236818,108.5,56.2648,7.1,405750.0,1834.0,True


In [125]:
df.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 427 entries, 1985-01-01 to 2020-07-01
Freq: MS
Data columns (total 11 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   corp_profits       427 non-null    float64
 1   exports_goods_svs  427 non-null    float64
 2   net_exports        427 non-null    float64
 3   gdp                427 non-null    float64
 4   10yr_treasury      427 non-null    float64
 5   cpi                427 non-null    float64
 6   industrial_prod    427 non-null    float64
 7   unemployment_rate  427 non-null    float64
 8   initial_claims     427 non-null    float64
 9   housing_starts     427 non-null    float64
 10  sp_ahead_pos_neg   427 non-null    bool   
dtypes: bool(1), float64(10)
memory usage: 53.3 KB


In [126]:
# Create a Feature Matrix and Target Vector
target = 'sp_ahead_pos_neg'

y = df[target]
X = df.drop(columns='sp_ahead_pos_neg')

print(X.shape, y.shape)

(427, 10) (427,)


In [127]:
# Split the data = create train and test sets 
# Will use 5-fold cross-validation with our training set

mask = df.index < '2014-01-01'

X_train, y_train = X.loc[mask], y.loc[mask]
X_test, y_test = X.loc[~mask], y.loc[~mask]

print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

(348, 10) (348,)
(79, 10) (79,)


In [128]:
# Let's look at the distribution of the target variable, 'sp_ahead_pos_neg', and determine our 
# majority class - the market has been up in the month ahead about 62.93% of the time.
# There is no gross overweighting of our classes, and we can use accuracy score to assess.
# We will also explore precision/recall and ROC-AUC curves for multiple models. 

baseline_outcomes = y_train.value_counts(normalize=True)*100
print(f'The majority class is True - "The Market Went Up"')
print(f'Our basline accuracy score is {baseline_outcomes[1]:.2f}%')

The majority class is True - "The Market Went Up"
Our basline accuracy score is 62.93%


In [129]:
# Logistic Regression 
model_lr = make_pipeline(
    StandardScaler(),
    LogisticRegression()
)

model_lr.fit(X_train, y_train)

Pipeline(steps=[('standardscaler', StandardScaler()),
                ('logisticregression', LogisticRegression())])

In [131]:
model_lr.score(X_train, y_train)

0.6494252873563219

In [132]:
# Random Forest Classifier 
model_rf = make_pipeline(
    RandomForestClassifier()
)

model_rf.fit(X_train, y_train)

Pipeline(steps=[('randomforestclassifier', RandomForestClassifier())])

In [133]:
model_rf.score(X_train, y_train)

1.0

In [None]:
# Let's look at some interesting hyperparameters and tune the model
