# Data Prep Notebook

This notebook was used to clean and format datasets. The flags in the cell below will control which actions will executed upon a "Run All" command of the notebook. 

### Clean Data

Extra tabs, linefeeds, stop words (NLTK), and punctuation will be removed. Contractions will be expanded to words. 

- `cleanDataSet_flag = True` -> This will run the `cleanUp()` function on the specified .csv dataset. 
- Set `cleanFilepath` to the location of the dataset to clean.
- Set `columnsToClean` to a list of which columns to run the cleaning. Some columns should not be cleaned, such as Date or non text columns. 

The resulting dataset will have the same file name with `_clean` added to the end of the file name. 

In [None]:
cleanDataSet_flag = False
cleanFilePath = '../Data/us_equities_news_dataset.csv'
columnsToClean = [2,4] # range(2,26)   #[2,4]

### Create Single Stock Dataset

Combines news articles for individual stocks with a label of that days stock performance. If the stock's closing price is >= to the opening price, then a label of 1 is assigned to represent a gain. Otherewise, a 0 is assigned to represent a loss. 

- `createSingleStockDataset_flag = True` -> This will create a dataset for the selected stock. 
- Set `newsFilePath` to the location of the news dataset. Note: this file is not stored on Github due to size limits. 
- Set `tickerSymbol` to one of the stocks listed in the dictionary below. 
- Set `textChoice` to either 'title' or 'content'. This will use just the title of the news article or the content of the news article. 

The resulting dataset will be named XXX_TEXT_NewsDataset.csv, where XXX is the name of the stock's ticker symbol and TEXT will be Title or Content. 

In [None]:
# Change these
createSingleStockDataset_flag = False
newsFilePath = '../Data/us_equities_news_dataset_clean.csv'
tickerSymbol = 'AAPL'   # pick a stock ticker here
textChoice = 'title' # 'content'    Choose 'title' of article or 'content' of the article here

# Init - leave these alone
tickerMap = {
    'AAPL' : {'pricePath' : '../Data/HistoricalData_AAPL.csv',  },
    'MSFT' : {'pricePath' : '../Data/HistoricalData_MSFT.csv',  },
    'AMZN' : {'pricePath' : '../Data/HistoricalData_AMZN.csv',  },
    'TSLA' : {'pricePath' : '../Data/HistoricalData_TSLA.csv',  },
    'AMD'  : {'pricePath' : '../Data/HistoricalData_AMD.csv',   },
    'NFLX' : {'pricePath' : '../Data/HistoricalData_NFLX.csv',  },
    'SBUX' : {'pricePath' : '../Data/HistoricalData_SBUX.csv',  },
    'GOOGL': {'pricePath' : '../Data/HistoricalData_GOOGL.csv', },
    'BA'   : {'pricePath' : '../Data/HistoricalData_BA.csv',    },
}

priceFilePath = tickerMap['AAPL']['pricePath']

### Combine Reddit News Headlines

The code was previously clean using the other functions. This just combined the headlines into one entry per data point.

Setting the flag to true will re-run the combine, which is already in the Data folder.

In [None]:
CombineRedditDataset_flag = False
redditFile = '../Data/Reddit_News_DJIA_clean2_clean.csv'

### Clean Data Code

In [None]:
if cleanDataSet_flag:
    import os
    parentPath = os.path.abspath('.')
    from CleanData import CleanData

    clean = CleanData(cleanFilePath, columnsToClean)
    clean.cleanUp()

### Single Stock Data Code

In [None]:
if createSingleStockDataset_flag:
    import pandas as pd

    def dateConvert(input):
        input = input.split('/')
        return input[2]+'-'+input[0]+'-'+input[1]

In [None]:
if createSingleStockDataset_flag:
    dataFrame = pd.read_csv(newsFilePath)
    stockDF = pd.DataFrame(dataFrame.loc[dataFrame['ticker'] == tickerSymbol])
    stockDF.rename(columns={'release_date':'Date',textChoice:'Text'}, inplace=True)
    priceDF = pd.read_csv(priceFilePath, converters={'Date': dateConvert})
    priceDF['Label'] = (priceDF['Open'] <= priceDF['Close/Last']) * 1
    combinedDF = pd.DataFrame()
    combinedDF['Date'] = priceDF['Date']
    combinedDF['Label'] = priceDF['Label']
    combinedDF = combinedDF.merge(stockDF[['Text','Date']], on='Date')
    combinedDF.to_csv('../Data/' + tickerSymbol + '_' + textChoice + '_' + 'NewsDataset.csv')
    print('Total data points =', len(combinedDF),'\n\n')
    print(combinedDF.head())

### Reddit Data Combine Code

In [None]:
if CombineRedditDataset_flag:
    import pandas as pd
    dataFrame = pd.read_csv(redditFile)
    columnsDict = dict(zip(dataFrame.columns, [str]*len(dataFrame.columns)))
    columnsDict.pop('Label')
    dataFrame = dataFrame.astype(columnsDict)
    combinedDF = pd.DataFrame()
    combinedDF['Text'] = dataFrame.iloc[:,2:].apply(' '.join, axis=1)
    combinedDF['Label'] = dataFrame['Label']
    combinedDF.to_csv('../Data/' + 'Reddit' + '_' + 'title' + '_' + 'NewsDataset.csv')
    print('Total data points =', len(combinedDF),'\n\n')
    print(combinedDF.head())