# Data Extraction

In [85]:
start = '2018-07-01'
end = '2021-07-01'
# We start off by saying what the range of data we want to look at is.
# Here we have specified that we start in 1st July 2018 to 1st July 2021.

# The dates here are in quotations marks ('' or "" are both usable) because
# they are strings here, but they will be interpreted as dates when we enter
# them into modules/packages.

Now that we have a very basic setup, we can get the data of the stocks.
<br>
Again, there are a few ways of doing this.
<br>
We'll do the two ways I'm familar with, and I don't think there are any other ways.

First, we'll start with the more blunt method, and then we'll incorporate some fancy Python programming so you can see how we can shorten the chunks of code down into a few lines, and make it more flexible so that we can change it whenever we feel like without having to manually change each line.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()

## The Longer Method

In [78]:
stock1 = pdr.DataReader('AAPL', 'yahoo', start, end)
# So we first name our variable stock1

# Next we dive into this variable and say it's from our import module pdr

    # From the pdr module, we shall use the DataReader function
    
        # NOTE: Where certain characters are capitalised, they must be 
        # written in that exact format, otherwise it won't work.
        
    # The DataReader function is the most commonly used, and I've not used
    # the pdr module for anything else apart from the DataReader function.
    
        # There are 4 parameters we should specify when we use this module.
            # You can put click anywhere in the the function (inside the 
            # brackets) and hit SHIFT+TAB and it'll show you all the
            # parameters that the function takes and what it assumes.
            
            # For example, if we had not stated the start and end, then we
            # the function would just return the maximum length of time that
            # the function is able to find on Yahoo, which is good because
            # we have more data, but you want all the data to be from the same
            # period to make a decent comparison.
        
        # The first parameter we define is the ticker symbol of the stock we 
        # want to look at.
            # The downside to this is that we have to know the ticker symbols
            # instead of just the name of the symbol we want to look at.
                # Also, what that symbol is in Yahoo Finance, otherwise it also
                # wouldn't work.
                
        # The next parameter we enter is 'yahoo' to specify where we want to look
        # for the data.
            # The reason we do this is because this function used to work for Google
            # too, but it has been deprecated so now only Yahoo works.
                # NOTE - It's crucial we write it as yahoo (without capitalisation).
                
        # Then we specify the start and end parameters.
        
        # Each parameter is separated by a comma.
        
stock1
# The above code simply returns the variable

Unnamed: 0_level_0,High,Low,Open,Close,Volume,Adj Close
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2019-07-01,51.122501,50.162498,50.792500,50.387501,109012000.0,49.314018
2019-07-02,50.782501,50.340000,50.352501,50.682499,67740800.0,49.602734
2019-07-03,51.110001,50.672501,50.820000,51.102501,45448000.0,50.013775
2019-07-05,51.270000,50.724998,50.837502,51.057499,69062000.0,49.969734
2019-07-08,50.349998,49.602501,50.202499,50.005001,101354400.0,48.939667
...,...,...,...,...,...,...
2021-06-25,133.889999,132.809998,133.460007,133.110001,70783700.0,132.717178
2021-06-28,135.250000,133.350006,133.410004,134.779999,62111300.0,134.382248
2021-06-29,136.490005,134.350006,134.800003,136.330002,64556100.0,135.927673
2021-06-30,137.410004,135.869995,136.169998,136.960007,63261400.0,136.555817


Notice how when we get the values, the results are printed out in a table. 
<br>
This format is called a **dataframe** which comes from pandas.
<br>
Since the data comes from pandas and it's a dataframe format, we can perform some __slicing.__

Now say we just wanted the Adjusted Close for each date.

In [79]:
stock1 = stock1['Adj Close']
# We are saying that stock1 is equal to the Adj Close column on stock1.
# This means, all other columns on High, Low, Open, Close, Volume, are all lost
# and the Adjusted Close (Adj Close) will all that remain (and the index column)
# from stock1.

# This operation is called column slicing.

# We can instead say we wanted another column, such as Close or Open or Volumn, etc.
# Feel free to try.

# Notice how if you run this line of code again you will get an error.
# This is because the stock1 variable is now simply a series data, and no longer a table
# with all the different columns.
# Therefore, if we call it again, it cannot find a specific column with that name.
# When we column slice, the name of the column in not preserved.

stock1

Date
2019-07-01     49.314018
2019-07-02     49.602734
2019-07-03     50.013775
2019-07-05     49.969734
2019-07-08     48.939667
                 ...    
2021-06-25    132.717178
2021-06-28    134.382248
2021-06-29    135.927673
2021-06-30    136.555817
2021-07-01    136.864899
Name: Adj Close, Length: 506, dtype: float64

Now say we wanted another stock to look at, we'll do the same thing but we'll slice immediately rather than making it a separate step.

In [46]:
stock2 = pdr.DataReader('TSLA', 'yahoo', start, end)['Adj Close']
# Notice here how we have sliced at the end of our code, instead writing another
# line of code.
stock2

Now let's say we wanted the two datasets side by side.
<br>
The easiest way I've found doing this is creating a dictionary.

In [55]:
data = pd.DataFrame({
    'Apple': stock1,
    'Tesla': stock2
})
# Here we have created a variable called data that we want in the form of a table
# (dataframe).
# We do this by first stating pd.DataFrame().

# We then let pandas know that we are using a dictionary with the curly brackets {}.
    # When we have the pd.DataFrame({}) and our cursor is between {}, if we hit enter
    # pandas automatically goes into the next line and indents.
    
# We specify the names of the column as the keys of the dictionary ('Apple' and 'Tesla').
    # NOTE: These have to be strings.
# We then specify the values for these keys, which we had used stock1 for AAPL and stock2
# for TSLA.

# NOTE: We could have written this instead and it would have worked the same:
    
    # data = pd.DataFrame({'Apple':stock1, 'Tesla':stock2})
    
    # This is fine, but it's just easier to read if there are a bit more spaces.

import pandas
data = pandas.DataFrame({
    'Apple': stock1,
    'Tesla': stock2
})
# Here we simply show that had we imported pandas without the as pd then it would've workded
# the same but we would've had to write out pandas instead of just pd.
    # The code aboe was:
        #import pandas as pd.

# We'll go back to using pandas as pd.

import pandas as pd
data

Unnamed: 0_level_0,Apple,Tesla
Date,Unnamed: 1_level_1,Unnamed: 2_level_1
2018-07-02,45.111362,67.014000
2018-07-03,44.325687,62.172001
2018-07-05,44.682377,61.832001
2018-07-06,45.301765,61.779999
2018-07-09,45.930782,63.702000
...,...,...
2021-06-25,132.717178,671.869995
2021-06-28,134.382248,688.719971
2021-06-29,135.927673,680.760010
2021-06-30,136.555817,679.700012


In [84]:
# To summarise our steps:

start = '2018-07-01'
end = '2021-07-01'

stock1 = pdr.DataReader('AAPL', 'yahoo', start, end)['Adj Close']
stock2 = pdr.DataReader('TSLA', 'yahoo', start, end)['Adj Close']
data = pd.DataFrame({
    'Apple': stock1,
    'Tesla': stock2
})
data

# Obviously we can shorten this and make this much more flexible as shown below.

Unnamed: 0_level_0,Apple,Tesla
Date,Unnamed: 1_level_1,Unnamed: 2_level_1
2018-07-02,45.111370,67.014000
2018-07-03,44.325695,62.172001
2018-07-05,44.682369,61.832001
2018-07-06,45.301762,61.779999
2018-07-09,45.930782,63.702000
...,...,...
2021-06-25,132.717178,671.869995
2021-06-28,134.382233,688.719971
2021-06-29,135.927658,680.760010
2021-06-30,136.555817,679.700012


## The shorter method

In [60]:
n = int(input('Please input the number of tickers you want to look at: '))
# Let's break the code down, starting from the left.

# n is just the variable we have set and it will be equal to what we want it to be.

# The int() function means integer
    # So this means that we want the value inside to be an integer.
    # So if we had n = int(3.9) and we returned the n value, we would get 3.
    # Notice that it does not round the value up.
    
# The input() function means we want to input something in Python.
    # The result of the input will always be a string, so if we had entered 5 then
    # the resulting value would be '5' but interpreted as a string (words) and not
    # numbers.
    
    # This is why we specify the int() function in front, because we want our n variable
    # to be a number.
    
    # In the input() function, you can enter a string of words that will show when the 
    # input() function is run.
    # I've just written the above but you can feel free to change it and play around with it.
    
# An alternative way to do this, which is often done by other programmers is using the following:

# n = map(int, input('Please input the number of tickers you want to look at: '))

# You can try getting rid of the hashtag # (and space after the hashtag too otherwise you'll get 
# and error) and see that it's the same method.
# We'll see below when this is required, and why we can't always use int(input()).

Please input the number of tickers you want to look at:  3


In [65]:
tickers = []
# We start off by creating an empty set of tickers (if you're uncertain about what tickers you want
# to use or you want to always look at different tickers).

# We start off with an empty set because we want to enter different values into the list as we go along.

# If you know the tickers you're looking for, then you can simply enter them into a list so we can 
# iterate through them.
    # For example:
        # ['AAPL', 'TSLA']

for i in range(0, n):
# We start with a for loop using range from 0 to n, value that we specifed earlier on.
# We start with 0 because that will be the first value and the nth value is not included.
# So if our n = 3, then range(1, n) will return the values [1, 2] and not [1, 2, 3].

    # NOTE: Our n has to be an integer, otherwise we cannot perform the for loop, i.e. 3.4
    # loops would not make sense.
    
# We use 'i' as that's just convention, but we can use whatever we want, just as long as we
# specify below.

# We can dive into a more complicated loops using discardable variables _ (underscores) but we can
# touch on that later if there's the need.
    
    print(i)
    # I have added this line to show what the value of i is at each stage.
    
    print('This is loop ' + str(i))
    # This is how we would print strings and numbers together - the numbers have to become strings
    # via the string function - str()
        # Think back to the integer function int(), this works in the exact same way except we change
        # the contents into a string instead of an integer.

    ticker = input(f'Please input ticker {i+1}: ')
    # Now each time we go through the loop, this input function will run.
    # We have also used f-strings here - f'{}'.
    # The contents inside the curly brackets {} will work as normal python code and not be strings.
    # Another example of f strings will be shown below, but this is how we can use them and they're
    # very handy.
    
    tickers.append(ticker.upper())
    # Now that we have entered our tickers, we shall add what we entered into the empty list prepared
    # using the append() function for lists.
    # The list function has to be first stated because the append function only works on the lists.
    # The value we want to add into lists then go into the function.
    
    # We can also call ticker.upper() function which will automatically capitalise all our strings.
    # So if we entered aapl the upper() will return AAPL.
        # NOTE: This is called on the string variable and not the list variable, and that it is called
        # at the end and the upper() function does not have any parameters inside.
        
tickers
# Now we see what we have created.

0
This is loop 0


Please input ticker 1:  AAPL


1
This is loop 1


Please input ticker 2:  TSLA


2
This is loop 2


Please input ticker 3:  FB


['AAPL', 'TSLA', 'FB']

Now that we have our tickers, we get our data.

In [66]:
data = pd.DataFrame()
# Now instead of an empty list, we want an empty dataframe since we know that we want a table to compare the data.
# We do not have to specify whether this is gotten from a dictionary or series, but all that matters is that the 
# data variable is a pandas dataframe.

for ticker in tickers:
# We use ticker and tickers here (again - just convention).
# We could've used whatever.

    print(ticker)
    # This shows what ticker is being looked at for each iteration / loop.
    
    data[ticker] = pdr.DataReader(ticker, 'yahoo', start, end)['Adj Close']
    # This is exactly the same to what we did previously, but much quicker in a for loop.
    # We have the variable as data[ticker] create a new column of data for each loop.
data

AAPL
TSLA
FB


Unnamed: 0_level_0,AAPL,TSLA,FB
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2018-07-02,45.111362,67.014000,197.360001
2018-07-03,44.325687,62.172001,192.729996
2018-07-05,44.682381,61.832001,198.449997
2018-07-06,45.301762,61.779999,203.229996
2018-07-09,45.930782,63.702000,204.740005
...,...,...,...
2021-06-25,132.717178,671.869995,341.369995
2021-06-28,134.382233,688.719971,355.640015
2021-06-29,135.927658,680.760010,351.890015
2021-06-30,136.555817,679.700012,347.709991


In [71]:
# In summary, this method may appear "longer" but it offers a lot more flexibility in the what the different
# equities that we want to look at, rather than changing it at each step:

n = int(input('Please input the number of tickers you want to look at: '))
tickers = []
for i in range(0, n):
    ticker = input(f'Please input ticker {i+1}: ')
    tickers.append(ticker.upper())
    
# If we know what tickers we are looking for, we can just have:
# tickers = ['AAPL', 'TSLA']
data = pd.DataFrame()
for ticker in tickers:
    data[ticker] = pdr.DataReader(ticker, 'yahoo', start, end)['Adj Close']
data

Please input the number of tickers you want to look at:  2
Please input ticker 1:  tsla
Please input ticker 2:  aapl


Unnamed: 0_level_0,TSLA,AAPL
Date,Unnamed: 1_level_1,Unnamed: 2_level_1
2018-07-02,67.014000,45.111359
2018-07-03,62.172001,44.325684
2018-07-05,61.832001,44.682381
2018-07-06,61.779999,45.301754
2018-07-09,63.702000,45.930786
...,...,...
2021-06-25,671.869995,132.717178
2021-06-28,688.719971,134.382233
2021-06-29,680.760010,135.927658
2021-06-30,679.700012,136.555817
