<h1>
<center>
Dataquest Guided Project 23:
Analyzing Stock Prices
</center>
</h1>

## Introduction

This is part of the Dataquest program.

- part of path **Data Engineer**
    - Step 2: **Handling Large Data Sets in Python**
        - Course 3 :  **Algorithms and Data Structures **
            - Processing Tasks with Stacks and Queues
            - Effectively using arrays and lists
            - Sorting and Searching arrays and lists
            - Hash tables
As this is a guided project, we are following and deepening the steps suggested by Dataquest. In this project, we will practise working with large datasets in pandas.

## Use case : Analyzing Stock Prices

In this guided project, we'll work with stock market data that was downloaded from [Yahoo Finance](https://finance.yahoo.com/) using the [yahoo_finance](https://pypi.python.org/pypi/yahoo-finance) Python package. This data consists of the daily stock prices from 2007-1-1 to 2017-04-17 for several hundred stock symbols traded on the [NASDAQ](http://www.nasdaq.com/) stock exchange. We downloaded all of the stock price data in a folder called prices. Each file in the prices folder is named for a specific stock symbol, and contains the : 

|Header| Description|
|------|------------|
|date| date that the data is from |
|close| the closing price on that day, which is the price when the trading day ends|
|open| the opening price on that day, which is the price when the trading day starts|
|high| the highest price the stock reached during trading|
|low| the lowest price the stock reached during trading|
|volume| the number of shares that were traded during the day|

The prices are stored in ascending order by day. Stock trading doesn't happen on certain days, like weekends and holidays, so there are gaps between days as we only have data for days on which trading happened.

To read in and store all of the data, we'll need several layers of indices: 
- Layer 1 : the stock symbol, or a numeric index representing the stock symbol
- Layer 2: the rows in a stock symbol csv file
- Layer 3: the column names in a stock symbol csv file. 

The layer 1 data structure is a hash table, the layer 2 data structure is a list, and the layer 3 data structure is a list. 

## Stock Price Data

Let's read in the data using the suggested structure. We'll use multiple processes to read in the data.

In [1]:
import concurrent.futures
import os

def read_file(filename):
    with open(filename, 'r') as f:
        data = f.read().strip()
    key = filename.replace(".csv", "").replace("prices/", "")
    data = data.split("\n")
    data = [d.split(",") for d in data]
    return key, data

results = []
pool = concurrent.futures.ProcessPoolExecutor(max_workers=2)
filenames = ["prices/{}".format(f) for f in os.listdir("prices")]
prices = pool.map(read_file, filenames)
prices = list(prices)
prices = dict(prices)

## Computing Aggregates

In [2]:
from dateutil.parser import parse

prices_columns = {}

for k,v in prices.items():
    price = v
    headers = price[0]
    price_columns = {}
    for i, header in enumerate(headers):
        values = [p[i] for p in price[1:]]
        if i > 0:
            values = [float(v) for v in values]
        else:
            values = [parse(v) for v in values]
        price_columns[header] = values
    prices_columns[k] = price_columns

Let's find out which stocks have the highest and lowest average closing prices. 

In [3]:
from statistics import mean

average_closing = {}
for k,v in prices_columns.items():
    average_closing[k] = mean(v["close"])

In [5]:
closing_tuples = [(k,v) for k,v in average_closing.items()]
sorted_closing_tuples = sorted(closing_tuples, key=lambda x:x[1])

In [17]:
lowest_average = [sorted_closing_tuples[i][0] for i in range(3)]
highest_average = [sorted_closing_tuples[-i][0] for i in range(1,4)]
print("It appears that {} have the 3 lowest average closing price \n whereas {} have the 3 highest closing price".format(lowest_average, highest_average))

It appears that ['blfs', 'apdn', 'bmra'] have the 3 lowest average closing price 
 whereas ['amzn', 'aapl', 'cme'] have the 3 highest closing price


## Finding the most Traded Stock Each Day

Let's now work on finding the most traded stock each day. In order to do this, we'll combine the volume for each stock on each day, and the stock symbol, then sort the volume in descending order. 

In [18]:
trades = {}
for k, v in prices_columns.items():
    for i,date in enumerate(v["date"]):
        if date not in trades:
            trades[date] = []
        trades[date].append([k,v["volume"][i]])

In [19]:
most_traded = []
for k, v in trades.items():
    ordered = sorted(v, key=lambda x: x[1])
    symbol = ordered[-1][0]
    most_traded.append([k, symbol])
most_traded = sorted(most_traded, key=lambda x: x[0])

most_traded

[[datetime.datetime(2007, 1, 3, 0, 0), 'aapl'],
 [datetime.datetime(2007, 1, 4, 0, 0), 'aapl'],
 [datetime.datetime(2007, 1, 5, 0, 0), 'aapl'],
 [datetime.datetime(2007, 1, 8, 0, 0), 'aapl'],
 [datetime.datetime(2007, 1, 9, 0, 0), 'aapl'],
 [datetime.datetime(2007, 1, 10, 0, 0), 'aapl'],
 [datetime.datetime(2007, 1, 11, 0, 0), 'aapl'],
 [datetime.datetime(2007, 1, 12, 0, 0), 'aapl'],
 [datetime.datetime(2007, 1, 16, 0, 0), 'aapl'],
 [datetime.datetime(2007, 1, 17, 0, 0), 'aapl'],
 [datetime.datetime(2007, 1, 18, 0, 0), 'aapl'],
 [datetime.datetime(2007, 1, 19, 0, 0), 'aapl'],
 [datetime.datetime(2007, 1, 22, 0, 0), 'aapl'],
 [datetime.datetime(2007, 1, 23, 0, 0), 'aapl'],
 [datetime.datetime(2007, 1, 24, 0, 0), 'aapl'],
 [datetime.datetime(2007, 1, 25, 0, 0), 'aapl'],
 [datetime.datetime(2007, 1, 26, 0, 0), 'aapl'],
 [datetime.datetime(2007, 1, 29, 0, 0), 'aapl'],
 [datetime.datetime(2007, 1, 30, 0, 0), 'aapl'],
 [datetime.datetime(2007, 1, 31, 0, 0), 'aapl'],
 [datetime.datetime(2007,

It looks that aapl is the most traded stock for most of the days.

## Searching for High Volumes Days

We now want to search for transactions in a list on a specific date. To do so, we will first select the 10 highest volume days, then use a binary search.

In [20]:
daily_volumes = {}

most_traded = []
for k, v in trades.items():
    volume = sum([item[1] for item in v])
    daily_volumes[k] = volume

In [21]:
volume_tuples = [[k,v] for k,v in daily_volumes.items()]
volume_tuples = sorted(volume_tuples, key=lambda x: x[1])

volume_tuples[-10:]

[[datetime.datetime(2008, 1, 24, 0, 0), 1533363200.0],
 [datetime.datetime(2008, 1, 16, 0, 0), 1536176400.0],
 [datetime.datetime(2007, 11, 8, 0, 0), 1553880500.0],
 [datetime.datetime(2008, 9, 29, 0, 0), 1555072400.0],
 [datetime.datetime(2008, 2, 7, 0, 0), 1559032100.0],
 [datetime.datetime(2008, 1, 22, 0, 0), 1578877700.0],
 [datetime.datetime(2008, 10, 8, 0, 0), 1599183500.0],
 [datetime.datetime(2007, 7, 26, 0, 0), 1611272800.0],
 [datetime.datetime(2008, 10, 10, 0, 0), 1770266900.0],
 [datetime.datetime(2008, 1, 23, 0, 0), 1964583900.0]]

In [22]:
import math

high_volume_days = [v[0] for v in volume_tuples[-10:]]

def binary_search(array, search):
    m = 0
    i = 0
    z = len(array) - 1
    while i<= z:
        m = math.floor(i + ((z - i) / 2))
        if array[m] == search:
            return m
        elif array[m] < search:
            i = m + 1
        elif array[m] > search:
            z = m - 1

high_volume_transactions = {}
for k,v in prices_columns.items():
    for day in high_volume_days:
        ind = binary_search(v["date"], day)
        if ind is None:
            continue
        if k not in high_volume_transactions:
            high_volume_transactions[k] = []
        high_volume_transactions[k].append(prices[k][ind])

In [23]:
high_volume_transactions

{'aal': [['2008-01-23', '13.14', '12.04', '13.42', '11.75', '4990600'],
  ['2008-01-15', '12.51', '11.85', '12.64', '11.75', '6321800'],
  ['2007-11-07', '22.60', '22.610001', '23.25', '22.00', '4501800'],
  ['2008-09-26', '6.12', '6.01', '6.29', '5.90', '4478800'],
  ['2008-02-06', '15.34', '14.76', '15.65', '14.06', '5329200'],
  ['2008-01-18', '12.92', '12.35', '13.14', '12.35', '3806100'],
  ['2008-10-07', '5.11', '6.18', '6.30', '4.95', '10827400'],
  ['2007-07-25', '34.84', '35.259998', '35.650002', '34.240002', '1992600'],
  ['2008-10-09', '3.63', '4.40', '4.74', '3.56', '8180200'],
  ['2008-01-22', '12.02', '12.26', '12.92', '11.61', '4828200']],
 'aame': [['2008-01-23', '1.48', '1.35', '1.50', '1.30', '6100'],
  ['2008-01-15', '1.50', '1.50', '1.50', '1.50', '400'],
  ['2007-11-07', '2.30', '2.25', '2.30', '2.13', '2500'],
  ['2008-09-26', '1.20', '1.22', '1.27', '1.20', '900'],
  ['2008-02-06', '1.65', '1.70', '1.70', '1.65', '2700'],
  ['2008-01-18', '1.49', '1.43', '1.50', 

## Finding Profitable Stocks

Let's see which stocks would have been the most profitable to buy on 2007-01-03

In [25]:
profits = []
for k,v in prices_columns.items():
    percentage = (v["close"][-1] - v["close"][0]) / v["close"][0]
    profits.append([k,percentage * 100])

profits = sorted(profits, key=lambda x: x[1])

profits[-10:]

[['achc', 1330.0000666666667],
 ['bcli', 1339.2137535980346],
 ['cui', 1525.1625162516252],
 ['apdn', 1549.6700659868025],
 ['anip', 1707.3554472785033],
 ['amzn', 2230.7234281466817],
 ['blfs', 2437.4365640858978],
 ['arcw', 3898.60048982856],
 ['adxs', 4005.0000000000005],
 ['admp', 7483.8389225948395]]

The most profitable stock to buy in 2007 would have been ADMP, which appreciated from around 7 cents to its current price of 4.43.

## Next Steps

We've done some basic analysis of the data, but there's still quite a bit more depth to go into:

- What stocks would have been best to short at the start of the period?
- Which stocks have the most after-hours trading, and show the biggest changes between the closing price and the next day open?
- Can technical indicators like Bollinger Bands help us forecast the market?
- What time periods have resulted in steady increases in prices, and what periods have resulted in steady declines?
- Based on price, what was the optimal day to buy each stock if we wanted to hold them until now?
- On days with high trading volume, do stocks move in one direction (up or down) more than the other one?