# Loading Data
Now that we are more familiar with the NumPy and Pandas packages, we can start loading data into Python to start working with. We can load data into our notebook a few different ways. Four different ways that we'll look at in this notebook are:
1. Generating data within our notebook using `np.arange()`
2. Generating data within our notebook using `np.random` functions
3. Importing external data into our notebook using Pandas DataReader
4. Importing an external .csv file into our notebook

## Generating Data with NumPy arange

In the last chapter, we demonstrated how to create numpy arrays by providing all the values in an array. Numpy also allows us to more efficiently generate a range of values. Let's first look at the `arange()` function that creates a range of intervals. The value 30 that we put inside the parentheses is the ending interval of the array. This means that Array_3 will start at 0, and end before 30.

In [20]:
import numpy as np
Array_3 = np.arange(30)
Array_3

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29])

We can also provide a starting interval, in this case 3, and an ending interval, in this case 7. This array starts at 3 and ends before 7.

In [12]:
np.arange(1, 101)

array([  1,   2,   3,   4,   5,   6,   7,   8,   9,  10,  11,  12,  13,
        14,  15,  16,  17,  18,  19,  20,  21,  22,  23,  24,  25,  26,
        27,  28,  29,  30,  31,  32,  33,  34,  35,  36,  37,  38,  39,
        40,  41,  42,  43,  44,  45,  46,  47,  48,  49,  50,  51,  52,
        53,  54,  55,  56,  57,  58,  59,  60,  61,  62,  63,  64,  65,
        66,  67,  68,  69,  70,  71,  72,  73,  74,  75,  76,  77,  78,
        79,  80,  81,  82,  83,  84,  85,  86,  87,  88,  89,  90,  91,
        92,  93,  94,  95,  96,  97,  98,  99, 100])

Another information we can add is a step. For example, let's create an array starting at 1, ending at 10, and the 2 step means the range runs every other number. And again, it goes up to but not the last value.

In [15]:
# Arrange (starting interval, ending interval, step)
np.arange(1, 10, 2)

array([1, 3, 5, 7, 9])

The last two examples are arrays consists of a range of integers. We can also gather a range of float values within an interval. For this purpose we'll use the `linspace()` function. This function includes a starting value, an ending value, and the number of items in this array. This example will return 20 values evenly spaced between 0 to 5.

In [19]:
# Gathering a range of values within an interval can also be done.
np.linspace(0, 5, 20)

array([0.        , 0.26315789, 0.52631579, 0.78947368, 1.05263158,
       1.31578947, 1.57894737, 1.84210526, 2.10526316, 2.36842105,
       2.63157895, 2.89473684, 3.15789474, 3.42105263, 3.68421053,
       3.94736842, 4.21052632, 4.47368421, 4.73684211, 5.        ])

A range has similar properties as lists and other numpy arrays when it comes to indexing. Let's use Array_3 that we created earlier as an example. To call the 5th value within the array, we call index 4 in the square brackes.

In [26]:
Array_3[-26]

4

## Generating Data with NumPy random

There are a lot of functions that allow us to generate randomized values in a numpy array. Let start with `random.rand` function. It generates random floats from 0 to 1. In the parentheses we can put how many numbers we want to generate. In this example we generated 10 random floats.

In [36]:
# creating an array of random floats
np.random.rand(10)

array([0.22906559, 0.54370096, 0.9682888 , 0.33277043, 0.46902272,
       0.47291913, 0.40313893, 0.09183932, 0.24620293, 0.25812439])

Another function we can use is the `random.randint` function. It can be used to gather a random integer within a range. For example, we will have a random whole number between 10 and 20. It is important to note that this function includes 10 up to 19.

In [9]:
# gathering a random interger within a range
np.random.randint(10, 20)

12

If we add a number in the end, we can generate a number of random integers wthin this range. This example creates an array of 5 integers in between 10 and 20. We might get repeatative values from the random functions.

In [41]:
# gathering an array of random intergers within a range
np.random.randint(10, 20, 10)

array([12, 17, 14, 12, 11, 14, 12, 16, 13, 16])

We can also create a random matrix by providing a shape. For example, let's create an array of integers in between 1 and 100, the matrix contains 2 rows and 4 columns. 

In [43]:
np.random.randint(1, 100, (6, 8))

array([[95, 14,  6, 47,  2, 70, 86, 22],
       [26, 67, 97,  9, 74,  3, 37, 28],
       [52, 75, 45, 96, 71, 84, 16,  5],
       [30, 58,  4, 53, 24, 63, 94, 94],
       [45, 13,  1, 54, 98, 66, 30, 85],
       [89, 43, 85, 19, 22, 32, 92, 47]])

Since all the numbers are randomly generated, if we run all of these functions again we're likely to get completely different values. If we want to generate a random number and keep it throughout a model, we can use the seed function.

Seed function tells Python under the hood to "randomize" from the same method for any computer.  So, it basically eliminates randomness.  We would do this to ensure reproducibiliy.  But in production, randomization helps in forecasting, so you could disable seed there.

In [53]:
# fixing the randomizer to get the same random values for reproduction
np.random.seed(10)
np.random.rand(6)

array([0.77132064, 0.02075195, 0.63364823, 0.74880388, 0.49850701,
       0.22479665])

If we remove the seed line, we'll get 3 different values every time we run the code.

In [56]:
# without the seed
np.random.rand(6)

array([0.91777412, 0.71457578, 0.54254437, 0.14217005, 0.37334076,
       0.67413362])

If we run the same code again, adding back the seed, we get the same result.

In [63]:
np.random.seed(10)
np.random.rand(1+5)

array([0.77132064, 0.02075195, 0.63364823, 0.74880388, 0.49850701,
       0.22479665])

## Importing External Data with Pandas DataReader

Rather than defining our data manually, we can leverage functionality within Pandas to connect to and import external data into our notebook.

Functions from `pandas_datareader.data` extract external data from the web into a Pandas DataFrame. We'll look at an example of using datareader to import financial data from Naver.

In [1]:
# Import Packages
import pandas as pd
import pandas_datareader.data as pdr

# The second argument is the data source we want to connect - in our example it's naver
# The first argument is the specific stock or index we want to retrieve data for
# 005930 corresponds to Samsung on the KOSPI index
# We also need to define the start and end date for the days of information that we want to import into our Dataframe

pdr.DataReader('005930', 'naver', '2023-07-01', '2023-07-31')

Unnamed: 0_level_0,Open,High,Low,Close,Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2023-07-03,72700,73200,72600,73000,10722181
2023-07-04,73400,73600,72900,73000,10214350
2023-07-05,73000,73300,71900,72000,12310610
2023-07-06,71900,72400,71500,71600,14777667
2023-07-07,71100,71400,69800,69900,17308877
2023-07-10,70000,70400,69200,69500,11713926
2023-07-11,70200,71500,70100,71500,12177392
2023-07-12,71200,72000,71100,71900,10375581
2023-07-13,72400,72600,71900,71900,14417279
2023-07-14,72500,73400,72400,73400,15882519


We can use functions from the `yfinance` package to connect to yahoo finance data for a particular stock, in this case an ETF that tracks the S&P 500 stock market index. This time we can define our arguments outside of the function. This will make them easier to find and change in the future. We can also store the resulting DataFrame for future use in our notebook.

In [2]:
import yfinance as yf

stock = 'DAX'
startdate = '2023-07-01'
enddate = '2023-07-31'

stock_df = yf.download(stock, start=startdate, end=enddate)
yf
print(stock_df)

[*********************100%***********************]  1 of 1 completed
                 Open       High        Low      Close  Adj Close  Volume
Date                                                                     
2023-07-03  29.520000  29.525000  29.420000  29.525000  29.525000    2400
2023-07-05  29.280001  29.280001  28.940001  29.059999  29.059999    6100
2023-07-06  28.629999  28.629999  28.275000  28.459999  28.459999   43000
2023-07-07  28.584999  28.840000  28.584999  28.830000  28.830000    3600
2023-07-10  28.820000  29.014000  28.790001  28.930000  28.930000   22300
2023-07-11  29.090000  29.250000  29.080000  29.240000  29.240000    7700
2023-07-12  29.770000  30.011000  29.770000  30.000000  30.000000    6400
2023-07-13  30.389999  30.459999  30.350000  30.459999  30.459999    4900
2023-07-14  30.309999  30.309999  30.309999  30.309999  30.309999       0
2023-07-17  30.170000  30.430000  30.120001  30.430000  30.430000    2900
2023-07-18  30.299999  30.410000  30.290001

In [11]:
stock = 'BMW.DE'
startdate = '2023-03-01'
endate = '2023-03-19'

stock_df = yf.download(stock, start=startdate, end=endate)

print(stock_df)

[*********************100%***********************]  1 of 1 completed
                  Open        High         Low       Close   Adj Close  \
Date                                                                     
2023-03-01   98.290001  100.620003   98.220001   99.339996   99.339996   
2023-03-02   98.889999   99.349998   97.589996   99.349998   99.349998   
2023-03-03   99.500000  102.019997   99.349998  101.480003  101.480003   
2023-03-06  101.419998  101.900002  100.959999  101.459999  101.459999   
2023-03-07  101.400002  101.699997  100.800003  101.000000  101.000000   
2023-03-08  100.760002  101.680000  100.419998  100.959999  100.959999   
2023-03-09  102.180000  103.699997   97.250000   99.529999   99.529999   
2023-03-10   96.040001   98.330002   96.000000   98.330002   98.330002   
2023-03-13   97.839996   98.080002   93.769997   94.639999   94.639999   
2023-03-14   94.650002   95.870003   93.300003   95.599998   95.599998   
2023-03-15   96.110001   97.199997   92.540

In [81]:
stock = 'VOW.DE'
startdate = '2023-07-01'
enddate = '2023-07-31'

stock_DataFrame = yf.download(stock, start = startdate, end = enddate)
print(stock_DataFrame)

[*********************100%***********************]  1 of 1 completed
                  Open        High         Low       Close   Adj Close  Volume
Date                                                                          
2023-07-03  154.199997  156.399994  153.899994  154.449997  154.449997   28799
2023-07-04  154.100006  155.000000  152.500000  153.649994  153.649994   14301
2023-07-05  152.500000  156.199997  152.500000  156.199997  156.199997   47219
2023-07-06  154.800003  154.899994  151.050003  151.750000  151.750000   26972
2023-07-07  152.199997  153.250000  150.500000  152.949997  152.949997   18302
2023-07-10  153.050003  154.100006  151.500000  152.550003  152.550003   22304
2023-07-11  153.300003  154.449997  151.050003  154.149994  154.149994   30722
2023-07-12  153.699997  155.000000  153.300003  154.699997  154.699997   29594
2023-07-13  154.100006  155.699997  153.550003  154.300003  154.300003   23604
2023-07-14  153.750000  154.000000  151.000000  151.850006  15

## Importing External .csv

One of the more powerful ways of creating a DataFrame is to import a .csv file. This will save tremendous amount of time in recreating a data source in Python and allow Python to read the data in. We can use the `pd.read.csv()` function to do this.

When using this function, either the relative file path or the complete file path must be submitted between quotation marks. For a relative file path, if the csv. file is in the same working directory folder as the notebook, you can simply submit the file name with the csv. suffix. 

If the file is in another folder, then you must provide extra information of the path pandas should take to refer to the file (ie. 'datafolder/Data Source.csv').

In [20]:
#Importing an external .csv using a relative file path
dataframe_3 = pd.read_csv('Data Source.csv')

dataframe_3

Unnamed: 0,Store,Date,CategoryID,ProductID,Q1,Q2,Q3,Q4
0,1,3/10/2019 0:00:00,1,11XXRP-P,24924.50,46039.49,41595.55,
1,1,6/4/2019 0:00:00,2,12XXRP-Q,50605.27,44682.74,47928.89,
2,1,1/16/2019 0:00:00,3,13XXRP,13740.12,10887.84,11523.47,
3,1,10/18/2019 0:00:00,4,14XXRP,39954.04,d35351.21,36826.95,
4,1,4/11/2019 0:00:00,5,15XXRP-A,32229.38,29620.81,26468.27,
...,...,...,...,...,...,...,...,...
1882,40,3/24/2019 0:00:00,44,4044XXRP,3280.00,3556.22,2965.33,3041.41
1883,40,3/24/2019 0:00:00,44,4044XXRP,3280.00,3556.22,2965.33,3041.41
1884,40,3/24/2019 0:00:00,44,4044XXRP,3280.00,3556.22,2965.33,3041.41
1885,40,9/30/2019 0:00:00,49,4049XXRP,0.00,,,-23.99


We will continue to work with .csv data in Python in the next section of this chapter. However, much like what we see in a real-world setting, data is not always perfect. We can see that we have errors in our data, such as NaN values and duplicates.