# Loading Data
Now that we are more familiar with the NumPy and Pandas packages, we can start loading data into Python to start working with. We can load data into our notebook a few different ways. Four different ways that we'll look at in this notebook are:
1. Generating data within our notebook using `np.arange()`
2. Generating data within our notebook using `np.random` functions
3. Importing external data into our notebook using Pandas DataReader
4. Importing an external .csv file into our notebook

## Generating Data with NumPy arange

In the last chapter, we demonstrated how to create numpy arrays by providing all the values in an array. Numpy also allows us to more efficiently generate a range of values. Let's first look at the `arange()` function that creates a range of intervals. The value 30 that we put inside the parentheses is the ending interval of the array. This means that Array_3 will start at 0, and end before 30.

In [35]:
import numpy as np
import pandas as pd
array3 = np.arange(30)
array3

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29])

We can also provide a starting interval, in this case 3, and an ending interval, in this case 7. This array starts at 3 and ends before 7.

In [4]:
np.arange(4,8)

array([4, 5, 6, 7])

Another information we can add is a step. For example, let's create an array starting at 1, ending at 10, and the 2 step means the range runs every other number. And again, it goes up to but not the last value.

In [5]:
# Arrange (starting interval, ending interval, step)
np.arange(1,10,2)

array([1, 3, 5, 7, 9])

The last two examples are arrays consists of a range of integers. We can also gather a range of float values within an interval. For this purpose we'll use the `linspace()` function. This function includes a starting value, an ending value, and the number of items in this array. This example will return 20 values evenly spaced between 0 to 5.

In [6]:
# Gathering a range of values within an interval can also be done.
np.linspace(0,8,40)

array([0.        , 0.20512821, 0.41025641, 0.61538462, 0.82051282,
       1.02564103, 1.23076923, 1.43589744, 1.64102564, 1.84615385,
       2.05128205, 2.25641026, 2.46153846, 2.66666667, 2.87179487,
       3.07692308, 3.28205128, 3.48717949, 3.69230769, 3.8974359 ,
       4.1025641 , 4.30769231, 4.51282051, 4.71794872, 4.92307692,
       5.12820513, 5.33333333, 5.53846154, 5.74358974, 5.94871795,
       6.15384615, 6.35897436, 6.56410256, 6.76923077, 6.97435897,
       7.17948718, 7.38461538, 7.58974359, 7.79487179, 8.        ])

A range has similar properties as lists and other numpy arrays when it comes to indexing. Let's use Array_3 that we created earlier as an example. To call the 5th value within the array, we call index 4 in the square brackes.

In [8]:
array3[4]


np.int64(4)

## Generating Data with NumPy random

There are a lot of functions that allow us to generate randomized values in a numpy array. Let start with `random.rand` function. It generates random floats from 0 to 1. In the parentheses we can put how many numbers we want to generate. In this example we generated 10 random floats.

In [10]:
# creating an array of random floats
np.random.rand(10)

array([0.05999653, 0.76248804, 0.95188932, 0.49984858, 0.08604553,
       0.59724195, 0.87281301, 0.12233287, 0.91468835, 0.25517815])

Another function we can use is the `random.randint` function. It can be used to gather a random integer within a range. For example, we will have a random whole number between 10 and 20. It is important to note that this function includes 10 up to 19.

In [13]:
# gathering a random interger within a range
np.random.randint(10,20)

11

If we add a number in the end, we can generate a number of random integers wthin this range. This example creates an array of 5 integers in between 10 and 20. We might get repeatative values from the random functions.

In [14]:
# gathering an array of random intergers within a range
np.random.randint(10,20,10)

array([10, 13, 15, 17, 15, 15, 14, 17, 12, 14], dtype=int32)

We can also create a random matrix by providing a shape. For example, let's create an array of integers in between 1 and 100, the matrix contains 2 rows and 4 columns. 

In [23]:
np.random.randint(1,1000,(10,10))

array([[298, 665, 359, 637, 699, 724,  42, 461, 245, 136],
       [528, 227, 365, 906,  53, 193, 625, 352, 618, 577],
       [506, 930, 243, 533, 835, 119, 895, 337, 137, 178],
       [125, 661, 781, 453, 940, 666, 529, 944, 919,  40],
       [ 53, 102, 306, 882, 343,   6, 614, 823, 130,   1],
       [ 53, 282, 629, 778, 795,  90, 558, 913, 266, 860],
       [685, 838, 804, 712,  35, 599,  78, 445, 909, 701],
       [386, 942, 134, 255, 223, 175, 873, 439, 137, 970],
       [927, 989, 326, 547, 974, 450,  36, 809, 196, 464],
       [139,  56, 872, 727, 115, 605, 438, 172, 166, 524]], dtype=int32)

Since all the numbers are randomly generated, if we run all of these functions again we're likely to get completely different values. If we want to generate a random number and keep it throughout a model, we can use the seed function.

Seed function tells Python under the hood to "randomize" from the same method for any computer.  So, it basically eliminates randomness.  We would do this to ensure reproducibiliy.  But in production, randomization helps in forecasting, so you could disable seed there.

In [21]:
# fixing the randomizer to get the same random values for reproduction
np.random.seed(3)
np.random.randint(1,1000,(10,10))

array([[875, 665, 250, 644, 953, 969, 257, 790, 660, 715],
       [876, 682, 651, 918, 120, 423, 481, 661, 941, 606],
       [424, 399, 505, 539, 338, 475, 663, 835, 259, 960],
       [317, 642, 948, 859, 582, 873, 623, 610, 414, 619],
       [153, 508, 831, 520, 172, 546, 380, 336, 177, 806],
       [ 21, 863, 562, 406, 384, 719, 501, 541, 888, 567],
       [  1, 705, 275, 373, 960, 678, 185, 953, 968, 166],
       [766, 687, 546, 114,   2, 854, 118, 888, 971, 748],
       [893, 997, 612, 732, 913, 512, 849, 673, 913, 275],
       [844, 184, 865, 224, 142, 422, 415, 561, 958, 290]], dtype=int32)

If we remove the seed line, we'll get 3 different values every time we run the code.

In [None]:
# without the seed


If we run the same code again, adding back the seed, we get the same result.

## Importing External Data with Pandas DataReader

Rather than defining our data manually, we can leverage functionality within Pandas to connect to and import external data into our notebook.

Functions from `pandas_datareader.data` extract external data from the web into a Pandas DataFrame. We'll look at an example of using datareader to import financial data from Naver.

In [36]:
# Import Packages
import pandas as pd
import pandas_datareader.data as pdr
# The second argument is the data source we want to connect to - in our example it's naver

# The first argument is the specific stock or index we want to retrive data for
# 005930 corresponds to Samsung on the KOSPI index

# We also need to define the start and end date for the days of information that we want to import into our DataFrame
#pdr.DataReader('005930','naver','2022-01-01','2022-01-31')

We can use functions from the `yfinance` package to connect to yahoo finance data for a particular stock, in this case an ETF that tracks the S&P 500 stock market index. This time we can define our arguments outside of the function. This will make them easier to find and change in the future. We can also store the resulting DataFrame for future use in our notebook.

In [30]:
# Import yfinance for yahoo data
stock = 'tsla'
source = 'yahoo'
startdate = '2021-01-01'
enddate = '2024-12-31'

stocks_df = pdr.DataReader(stock, source, startdate,enddate)
print(stocks_df)

NameError: name 'pdr' is not defined

## Importing External .csv

One of the more powerful ways of creating a DataFrame is to import a .csv file. This will save tremendous amount of time in recreating a data source in Python and allow Python to read the data in. We can use the `pd.read.csv()` function to do this.

When using this function, either the relative file path or the complete file path must be submitted between quotation marks. For a relative file path, if the csv. file is in the same working directory folder as the notebook, you can simply submit the file name with the csv. suffix. 

If the file is in another folder, then you must provide extra information of the path pandas should take to refer to the file (ie. 'datafolder/Data Source.csv').

In [42]:
# Importing an external .csv using a relative file path
datafram3 = pd.read_csv('C:\\Users\\WY343EH\\Downloads\\Projects\\Python\\Python Fundamentals - Learner Files\\C2 - Loading & Cleaning Data\\1 - Start\\Data Source.csv')

# Use datafram3 instead of dataframe4
print(datafram3)


      Store                Date  CategoryID ProductID        Q1         Q2  \
0         1   3/10/2019 0:00:00           1  11XXRP-P  24924.50   46039.49   
1         1    6/4/2019 0:00:00           2  12XXRP-Q  50605.27   44682.74   
2         1   1/16/2019 0:00:00           3    13XXRP  13740.12   10887.84   
3         1  10/18/2019 0:00:00           4    14XXRP  39954.04  d35351.21   
4         1   4/11/2019 0:00:00           5  15XXRP-A  32229.38   29620.81   
...     ...                 ...         ...       ...       ...        ...   
1882     40   3/24/2019 0:00:00          44  4044XXRP   3280.00    3556.22   
1883     40   3/24/2019 0:00:00          44  4044XXRP   3280.00    3556.22   
1884     40   3/24/2019 0:00:00          44  4044XXRP   3280.00    3556.22   
1885     40   9/30/2019 0:00:00          49  4049XXRP      0.00        NaN   
1886     40   6/14/2019 0:00:00          43  4093XXRP  31080.68   26600.39   

            Q3        Q4  
0     41595.55       NaN  
1     479

We will continue to work with .csv data in Python in the next section of this chapter. However, much like what we see in a real-world setting, data is not always perfect. We can see that we have errors in our data, such as NaN values and duplicates.