# Statistics from Stock Data

In this lab we will load stock data into a Pandas Dataframe and calculate some statistics on it. We will be working with stock data from Google, Apple, and Amazon. All the stock data was downloaded from yahoo finance in CSV format. In your workspace you should have a file named GOOG.csv containing the Google stock data, a file named AAPL.csv containing the Apple stock data, and a file  named AMZN.csv containing the Amazon stock data. All the files contain 7 columns of data:

**Date Open High Low Close Adj_Close Volume**

We will start by reading in any of the above CSV files into a DataFrame and see what the data looks like.

In [2]:
# We import pandas into Python
import pandas as pd
# We read in a stock data data file into a data frame and see what it looks like
data = pd.read_csv("GOOG.csv")

# We display the first 5 rows of the DataFrame
data.head()

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume
0,2004-08-19,49.676899,51.693783,47.669952,49.845802,49.845802,44994500
1,2004-08-20,50.178635,54.187561,49.925285,53.80505,53.80505,23005800
2,2004-08-23,55.017166,56.373344,54.172661,54.346527,54.346527,18393200
3,2004-08-24,55.260582,55.439419,51.450363,52.096165,52.096165,15361800
4,2004-08-25,52.140873,53.651051,51.604362,52.657513,52.657513,9257400


We clearly see that the Dataframe is has automatically labeled the row indices using integers and has labeled the columns of the DataFrame using the names of the columns in the CSV files.

# To Do

You will now load the stock data from Google, Apple, and Amazon into separte DataFrames. However, for each stock data you will only be interested in loading the `Date` and `Adj Close` columns into the Dataframe. In addtion, you want to use the `Date` column as your row index. Finally, you want the DataFrame to recognize the dates as actual dates (year/month/day) and not as strings. For each stock, you can accomplish all theses things in just one line of code by using the appropiate keywords in the `pd.read_csv()` function. Here are a few hints:

* Use the `index_col` keyword to indicate which column you want to use as an index. For example `index_col = ['Open']`

* Set the `parse_dates` keyword equal to `True` to convert the Dates into real dates of the form year/month/day

* Use the `usecols` keyword to select which columns you want to load into the DataFrame. For example `usecols = ['Open', 'High']`

Fill in the code below:

In [10]:
# We load the Google stock data into a DataFrame
google_stock = pd.read_csv("GOOG.csv",index_col=["Date"],parse_dates=True,usecols=["Date","Adj Close"])

# We load the Apple stock data into a DataFrame
apple_stock = pd.read_csv("AAPL.csv",index_col=["Date"],parse_dates=True,usecols=["Date","Adj Close"])
                       
# We load the Amazon stock data into a DataFrame
amazon_stock = pd.read_csv("AMZN.csv",index_col=["Date"],parse_dates=True,usecols=["Date","Adj Close"])

print(google_stock.head())
print(apple_stock.head())
print(amazon_stock.head())

            Adj Close
Date                 
2004-08-19  49.845802
2004-08-20  53.805050
2004-08-23  54.346527
2004-08-24  52.096165
2004-08-25  52.657513
            Adj Close
Date                 
2000-01-03   3.596616
2000-01-04   3.293384
2000-01-05   3.341579
2000-01-06   3.052405
2000-01-07   3.196992
            Adj Close
Date                 
2000-01-03    89.3750
2000-01-04    81.9375
2000-01-05    69.7500
2000-01-06    65.5625
2000-01-07    69.5625


You can check that you have loaded the data correctly by displaying the head of the DataFrames.

In [11]:
# We display the google_stock DataFrame
google_stock

Unnamed: 0_level_0,Adj Close
Date,Unnamed: 1_level_1
2004-08-19,49.845802
2004-08-20,53.805050
2004-08-23,54.346527
2004-08-24,52.096165
2004-08-25,52.657513
...,...
2017-10-09,977.000000
2017-10-10,972.599976
2017-10-11,989.250000
2017-10-12,987.830017


You will now join the three DataFrames above to create a single new DataFrame that contains all the `Adj Close` for all the stocks. Let's start by creating an empty DataFrame that has as row indices calendar days between `2000-01-01`  and `2016-12-31`. We will use the `pd.date_range()` function to create the calendar dates first and then we will create a DataFrame that uses those dates as row indices:

In [29]:
# We create calendar dates between '2000-01-01' and  '2016-12-31'
import pandas as pd 
dates = pd.date_range('2000-01-01', '2016-12-31')

# We create and empty DataFrame that uses the above dates as indices
all_stocks = pd.DataFrame(index = dates)
all_stocks

2000-01-01
2000-01-02
2000-01-03
2000-01-04
2000-01-05
...
2016-12-27
2016-12-28
2016-12-29
2016-12-30
2016-12-31


# To Do

You will now join the the individual DataFrames, `google_stock`, `apple_stock`, and `amazon_stock`, to the `all_stocks` DataFrame. However, before you do this, it is necessary that you change the name of the columns in each of the three dataframes. This is because the column labels in the `all_stocks` dataframe must be unique. Since all the columns in the individual dataframes have the same name, `Adj Close`, we must change them to the stock name before joining them. In the space below change the column label `Adj Close` of each individual dataframe to the name of the corresponding stock. You can do this by using the `pd.DataFrame.rename()` function. 

In [19]:
# Change the Adj Close column label to Google
google_stock = google_stock.rename(columns = {'Adj Close':'Google'})

# Change the Adj Close column label to Apple
apple_stock = apple_stock.rename(columns = {'Adj Close' : 'Apple'})

# Change the Adj Close column label to Amazon
amazon_stock =  amazon_stock.rename(columns = {'Adj Close' : 'Amazon'})

google_stock.head(), apple_stock.head(), amazon_stock.head()


(               Google
 Date                 
 2004-08-19  49.845802
 2004-08-20  53.805050
 2004-08-23  54.346527
 2004-08-24  52.096165
 2004-08-25  52.657513,
                Apple
 Date                
 2000-01-03  3.596616
 2000-01-04  3.293384
 2000-01-05  3.341579
 2000-01-06  3.052405
 2000-01-07  3.196992,
              Amazon
 Date               
 2000-01-03  89.3750
 2000-01-04  81.9375
 2000-01-05  69.7500
 2000-01-06  65.5625
 2000-01-07  69.5625)

You can check that the column labels have been changed correctly by displaying the datadrames

In [21]:
# We display the google_stock DataFrame
google_stock.head()

Unnamed: 0_level_0,Google
Date,Unnamed: 1_level_1
2004-08-19,49.845802
2004-08-20,53.80505
2004-08-23,54.346527
2004-08-24,52.096165
2004-08-25,52.657513


In [22]:
# We display the apple_stock DataFrame
apple_stock.head()

Unnamed: 0_level_0,Apple
Date,Unnamed: 1_level_1
2000-01-03,3.596616
2000-01-04,3.293384
2000-01-05,3.341579
2000-01-06,3.052405
2000-01-07,3.196992


In [20]:
# We display the amazon_stock DataFrame
amazon_stock.head()

Unnamed: 0_level_0,Amazon
Date,Unnamed: 1_level_1
2000-01-03,89.375
2000-01-04,81.9375
2000-01-05,69.75
2000-01-06,65.5625
2000-01-07,69.5625


Now that we have unique column labels, we can join the individual DataFrames to the `all_stocks` DataFrame. For this we will use the `dataframe.join()` function. The function `dataframe1.join(dataframe2)` joins `dataframe1` with `dataframe2`. We will join each dataframe one by one to the `all_stocks` dataframe. Fill in the code below to join the dataframes, the first join has been made for you:

## option (1) either use (1) or (2)

In [26]:
# We join the Google stock to all_stocks
all_stocks = all_stocks.join(google_stock)

# We join the Apple stock to all_stocks
all_stocks = all_stocks.join(apple_stock)

# We join the Amazon stock to all_stocks
all_stocks = all_stocks.join(amazon_stock)

## option(2)


In [31]:
all_stocks=pd.concat([all_stocks,google_stock,apple_stock,amazon_stock],axis="columns")

You can check that the dataframes have been joined correctly by displaying the `all_stocks`  dataframe

In [32]:
# We display the all_stocks DataFrame
all_stocks

Unnamed: 0,Google,Apple,Amazon
2000-01-01,,,
2000-01-02,,,
2000-01-03,,3.596616,89.375000
2000-01-04,,3.293384,81.937500
2000-01-05,,3.341579,69.750000
...,...,...,...
2017-10-09,977.000000,155.839996,990.989990
2017-10-10,972.599976,155.899994,987.200012
2017-10-11,989.250000,156.550003,995.000000
2017-10-12,987.830017,156.000000,1000.929993


# To Do

Before we proceed to get some statistics on the stock data, let's first check that we don't have any *NaN* values. In the space below check if there are any *NaN* values in the `all_stocks`  dataframe. If there are any, remove any rows that have *NaN* values:

In [44]:
# Check if there are any NaN values in the all_stocks dataframe
X = all_stocks.isnull().sum() / all_stocks.shape[0]*100
print('The percentage of nulls in each column:\n', X)

The percentage of nulls in each column:
 Google    48.299001
Apple     30.165418
Amazon    30.165418
dtype: float64


In [51]:
# Remove any rows that contain NaN values
all_stocks_dopped_nulls = all_stocks.dropna(axis = 0)

You can check that the *NaN* values have been eliminated by displaying the `all_stocks`  dataframe

In [52]:
# Check if there are any NaN values in the all_stocks dataframe
Y = all_stocks_dopped_nulls.isnull().sum()
Y

Google    0
Apple     0
Amazon    0
dtype: int64

Display the `all_stocks`  dataframe and verify that there are no *NaN* values 

In [53]:
# We display the all_stocks DataFrame
all_stocks_dopped_nulls

Unnamed: 0,Google,Apple,Amazon
2004-08-19,49.845802,1.973460,38.630001
2004-08-20,53.805050,1.979244,39.509998
2004-08-23,54.346527,1.997236,39.450001
2004-08-24,52.096165,2.053144,39.049999
2004-08-25,52.657513,2.123831,40.299999
...,...,...,...
2017-10-09,977.000000,155.839996,990.989990
2017-10-10,972.599976,155.899994,987.200012
2017-10-11,989.250000,156.550003,995.000000
2017-10-12,987.830017,156.000000,1000.929993


Now that you have eliminated any *NaN* values we can now calculate some basic statistics on the stock prices. Fill in the code below

In [67]:
# Print the average stock price for each stock
average = all_stocks_dopped_nulls.corr().mean(numeric_only = True)
print('The average of each stock: \n', average)
# Print the median stock price for each stock
med = all_stocks_dopped_nulls.median(numeric_only = True)
print('\nThe median of each stock: \n', med)

# Print the standard deviation of the stock price for each stock  
standardDev = all_stocks_dopped_nulls.std(numeric_only = True)
print('\nThe std of each stock: \n', standardDev)

# Print the correlation between stocks
print('\nThe correlation between the three stocks: \n', all_stocks_dopped_nulls.corr())

The average of each stock: 
 Google    0.965071
Apple     0.948369
Amazon    0.960534
dtype: float64

The median of each stock: 
 Google    293.029114
Apple      43.554916
Amazon    179.460007
dtype: float64

The std of each stock: 
 Google    223.853780
Apple      43.068747
Amazon    256.779834
dtype: float64

The correlation between the three stocks: 
           Google     Apple    Amazon
Google  1.000000  0.929359  0.965853
Apple   0.929359  1.000000  0.915749
Amazon  0.965853  0.915749  1.000000
