# Analyzing Stock Prices

## Introduction

In this guided project, we'll work with stock market data that was downloaded from [Yahoo Finance](https://finance.yahoo.com/) using the [yahoo_finance](https://pypi.python.org/pypi/yahoo-finance) Python package. This data consists of the daily stock prices from _2007-1-1_ to _2017-04-17_ for several hundred stock symbols traded on the [NASDAQ](http://www.nasdaq.com/) stock exchange, stored in the prices folder. The **download_data.py** script in the same folder as the Jupyter notebook was used to download all of the stock price data. Each file in the prices folder is named for a specific stock symbol, and contains the:
- **date** -- date that the data is from.
- **close** -- the closing price on that day, which is the price when the trading day ends.
- **open** -- the opening price on that day, which is the price when the trading day starts.
- **high** -- the highest price the stock reached during trading.
- **low** -- the lowest price the stock reached during trading.
- **volume** -- the number of shares that were traded during the day.

The prices are sorted in ascending order by day. Stock trading doesn't happen on certain days, like weekends and holidays, so there are gaps between days -- we only have data for days on which trading happening.

To read in and store all of the data, we'll need several layers of indices:
- **Layer 1** -- the stock symbol, or an numeric index representing the stock symbol.
- **Layer 2** -- the rows in a stock symbol csv file.
- **Layer 3** -- The column names in a stock symbol csv file.

A good choice to structure the data could be:
- **Hash table** for Layer 1 (stock symbol)
- **Hash table** for Layer 2 (columns)
- **List** for Layer 3 (rows)

## Importing packages

In [1]:
import os
import concurrent.futures
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import datetime as dt
from statistics import mean

%matplotlib inline

## Reading files

We will read all the files save the information as:
1. Dictionaries of stocks (stock name is the key parameter).
2. Each stock is a dictionary, where keys are the different column names in the dataset.
3. Inside column dictionaries there is the list of values (chronologically sorted, as in the dataset).

In [2]:
def read_file(filepath):
    with open(filepath, 'r') as f:
        data = f.read().strip()
    key = filepath.replace(".csv", "").replace("my_datasets/prices/", "")
    data = data.split("\n")
    data = [d.split(",") for d in data][1:]
    data_dict = get_data_dict(data)
    return key, data_dict

def get_data_dict(data):
    columns = ['date', 'close', 'open', 'high', 'low', 'volume']
    types = ["date","float","float","float","float","int"]
    data_dict = {}
    for i, name in enumerate(columns):
        if types[i] == "date":
            data_dict[name] = [dt.date.fromisoformat(d[i]) for d in data]
        elif types[i] == "float":
            data_dict[name] = [float(d[i]) for d in data]
        elif types[i] == "int":
            data_dict[name] = [int(d[i]) for d in data]
    return data_dict

In [3]:
# Multiprocess not available in W10
# results = []
# pool = concurrent.futures.ProcessPoolExecutor(max_workers=2)
# filepaths = ["my_datasets/prices/{}".format(f) for f in os.listdir("my_datasets/prices")]
# prices = pool.map(read_file, filepaths)
# prices = list(prices)
# prices = dict(prices)

In [4]:
filepaths = ["my_datasets/prices/{}".format(f) for f in os.listdir("my_datasets/prices")]
prices = dict([read_file(f) for f in filepaths])

## Computing aggregates

- Computing the **average of closing values** for each stock:

In [5]:
avg_closing = {}
for key, val in prices.items():
    avg_closing[key] = mean(val["close"])

As dictionaries cannot be sorted, we can convert it to list of tuples "(key, value)" and sort it by value parameter.

In [6]:
avg_closing_tuples = sorted([(k,v) for k,v in avg_closing.items()], key = lambda x: x[1], reverse=True)
#Top5 of highest average closing values
avg_closing_tuples[:5]

[('amzn', 275.1340775710425),
 ('aapl', 257.1765404023166),
 ('cme', 230.29466011003862),
 ('atri', 228.38977615984555),
 ('fcnca', 200.2524827814672)]

- Computing the **average volumes** for each stock:

In [7]:
avg_volume = {}
for key, val in prices.items():
    avg_volume[key] = mean(val["volume"])
    
#We can create a list of tuples again    
avg_volume_tuples = sorted([(k,v) for k,v in avg_volume.items()], key = lambda x: x[1], reverse=True)

#Top5 of highest average volumes
avg_volume_tuples[:5]

[('aapl', 130112422.35521236),
 ('csco', 45224781.428571425),
 ('cmcsa', 34337459.69111969),
 ('ebay', 29059822.548262548),
 ('amd', 24757016.94980695)]

- The **average difference between the opening price and the closing price** for each stock:

In [8]:
avg_daily_diff = {}
for key, val in prices.items():
    avg_daily_diff[key] = mean([abs(o-c) for o,c in zip(val["open"], val["close"])])
    
#We can create a list of tuples again    
avg_daily_diff_tuples = sorted([(k,v) for k,v in avg_daily_diff.items()], key = lambda x: x[1], reverse=True)

#Top5 of highest average daily variations
avg_daily_diff_tuples[:5]

[('cme', 3.6001273864864864),
 ('bidu', 3.5598418084942085),
 ('amzn', 3.1338533413127414),
 ('aapl', 2.860417226640927),
 ('atri', 2.718297284942085)]

## Finding The Most Traded Stock Each Day

Now that we've computed some aggregates, we can work on finding the most traded stock each day. We'll need to create a data structure that stores the dates and the stock symbols that were the most traded on that day. 

In order to find this, we'll need to combine the volume for each stock on each day, and the stock symbol, then sort the volume in descending order.

We can create a structure of:
- Dictionary with date as key
- Tuples with company and volume (company, volume) in the specific date

In [9]:
# Create structure in a dictionary with date as key
trades = {}

for key, val in prices.items():
    for i, date in enumerate(val["date"]):
        if date not in trades:
            trades[date]=[]
        trades[date].append((key,val["volume"][i]))

In [10]:
# Take the stock with highest volume for each date
most_traded = []

for key, val in trades.items():
    sorted_trades = sorted(val, key= lambda x: x[1], reverse=True)
    most_traded.append((key,sorted_trades[0][0]))
    
most_traded = sorted(most_traded, key= lambda x: x[0])

In [11]:
most_traded[:5]

[(datetime.date(2007, 1, 3), 'aapl'),
 (datetime.date(2007, 1, 4), 'aapl'),
 (datetime.date(2007, 1, 5), 'aapl'),
 (datetime.date(2007, 1, 8), 'aapl'),
 (datetime.date(2007, 1, 9), 'aapl')]

## Searching For High Volume Days

Let's say we want to search for transactions in a list on a specific date. We can use a binary or a linear search for this, but binary search will be faster if we want to do repeated searches.

Let's search for all transactions on days with unusually high volume. In order to do this, we'll need to:
- Compute total volume of trading for each day
- Sort and find the 10 highest volume days overall
- Find all prices for all stocks on each of the high volume days

In [15]:
total_volume_per_days = []

for key, val in trades.items():
    total_volume = sum([x[1] for x in val])
    total_volume_per_days.append((key, total_volume))

top_10_days_w_volume = sorted(total_volume_per_days, key= lambda x: x[1], reverse=True)[:10]
top_10_days_w_volume

[(datetime.date(2008, 1, 23), 1964583900),
 (datetime.date(2008, 10, 10), 1770266900),
 (datetime.date(2007, 7, 26), 1611272800),
 (datetime.date(2008, 10, 8), 1599183500),
 (datetime.date(2008, 1, 22), 1578877700),
 (datetime.date(2008, 2, 7), 1559032100),
 (datetime.date(2008, 9, 29), 1555072400),
 (datetime.date(2007, 11, 8), 1553880500),
 (datetime.date(2008, 1, 16), 1536176400),
 (datetime.date(2008, 1, 24), 1533363200)]

In [16]:
top_10_days = [x[0] for x in top_10_days_w_volume]
top_10_days

[datetime.date(2008, 1, 23),
 datetime.date(2008, 10, 10),
 datetime.date(2007, 7, 26),
 datetime.date(2008, 10, 8),
 datetime.date(2008, 1, 22),
 datetime.date(2008, 2, 7),
 datetime.date(2008, 9, 29),
 datetime.date(2007, 11, 8),
 datetime.date(2008, 1, 16),
 datetime.date(2008, 1, 24)]

In [17]:
def binary_search(array, value):
    m = 0
    i = 0
    z = len(array) - 1
    while i <= z:
        m = int(i + ((z-i)/2))
        if array[m] == value:
            return m
        elif array[m] < value:
            i = m + 1
        elif array[m] > value:
            z = m - 1

In [18]:
def get_day_row(key, index):
    row = []
    for k,v in prices[key].items():
        row.append((k, v[index]))
    return row

In [20]:
high_volume_days = {}

for key, val in prices.items():
    for day in top_10_days:
        ind = binary_search(val["date"], day)
        if ind is None:
            continue
        if key not in high_volume_days:
            high_volume_days[key] = []
        high_volume_days[key].append(get_day_row(key,ind))
        
high_volume_days["aapl"]

[[('date', datetime.date(2008, 1, 23)),
  ('close', 139.070005),
  ('open', 136.190006),
  ('high', 140.0),
  ('low', 126.140003),
  ('volume', 843242400)],
 [('date', datetime.date(2008, 10, 10)),
  ('close', 96.799999),
  ('open', 85.699999),
  ('high', 99.999999),
  ('low', 85.000003),
  ('volume', 554824900)],
 [('date', datetime.date(2007, 7, 26)),
  ('close', 146.000004),
  ('open', 145.910002),
  ('high', 148.499994),
  ('low', 136.959997),
  ('volume', 546657300)],
 [('date', datetime.date(2008, 10, 8)),
  ('close', 89.789999),
  ('open', 85.909997),
  ('high', 96.330002),
  ('low', 85.679998),
  ('volume', 551935300)],
 [('date', datetime.date(2008, 1, 22)),
  ('close', 155.639997),
  ('open', 148.059998),
  ('high', 159.980003),
  ('low', 146.000004),
  ('volume', 608688500)],
 [('date', datetime.date(2008, 2, 7)),
  ('close', 121.239998),
  ('open', 119.969995),
  ('high', 124.779999),
  ('low', 117.27),
  ('volume', 520832900)],
 [('date', datetime.date(2008, 9, 29)),
  ('c

## Finding Profitable Stocks

Now that we've done some basic analysis, let's see which stocks would have been the most profitable to buy on 2007-01-03. We can do this by:
- Subtracting the initial price (first day in dataset) from the final price (last day in dataset), then computing a percentage relative to the initial price. This will tell us how much our initial investment would have grown or shrunk.
- Sorting all of the percentages.
- Finding the stock that grew the most in the time period.

In [21]:
profits = []

for key, val in prices.items():
    percentage = ((val["close"][-1] - val["open"][0]) / val["open"][0]) * 100
    profits.append((key,percentage))
    
profits = sorted(profits, key=lambda x: x[1], reverse=True)

profits[:10]

[('admp', 7483.8389225948395),
 ('adxs', 4461.111111111112),
 ('arcw', 3898.60048982856),
 ('blfs', 2799.9585720203995),
 ('amzn', 2231.928619441572),
 ('anip', 1681.8998622293218),
 ('apdn', 1549.6700659868025),
 ('cui', 1525.1625162516252),
 ('axgn', 1502.7397260273972),
 ('bcli', 1449.9225038748066)]

The most profitable stock to buy in 2007 would have been ADMP, which has increased a 7483% since the first day (03/01/2007).