## Numpy & Arrays and Vectors

NumPy package contains multi-dimensional arrays, various math. operations as well as lin. algebra

In [1]:
import numpy as np

Arrays can be created with help of 'array' function

In [2]:
data1 = [6, 7.5, 8, 0, 1]
arr1 = np.array(data1)
arr1

array([6. , 7.5, 8. , 0. , 1. ])

In [3]:
data2 = [[1, 2, 3, 4], [5, 6, 7, 8]]
arr2 = np.array(data2)
arr2

array([[1, 2, 3, 4],
       [5, 6, 7, 8]])

In [4]:
arr2.ndim

2

In [5]:
arr2.shape

(2, 4)

In [6]:
np.zeros((2, 3))
#np.ones((3,3))
#np.empty((2, 3, 2))
#np.arange(10)

array([[0., 0., 0.],
       [0., 0., 0.]])

Various arithmetic operations are possible. 
(Operation between equal-sized arrays are elementwise. )

In [7]:
arr = np.array([[1., 2., 3.], [4., 5., 6.]])
arr

array([[1., 2., 3.],
       [4., 5., 6.]])

In [8]:
#arr * arr
#arr - arr
arr2 = np.array([[0., 4., 1.], [7., 2., 12.]])
arr2 > arr

array([[False,  True, False],
       [ True, False,  True]])

There are multiple ways to select the elements or subsets.

In [9]:
arr0 = np.arange(15)
arr0

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14])

Do not forget indexing from the zero from the left side. 

Note that slices are 'the views' referencing to original values, and thus can be rewritten!

In [10]:
slice = arr0[0:3]
slice[1] = 12345
arr0

array([    0, 12345,     2,     3,     4,     5,     6,     7,     8,
           9,    10,    11,    12,    13,    14])

In [11]:
sliceC = arr0[0:3].copy()
sliceC[1] = 6789
arr0

array([    0, 12345,     2,     3,     4,     5,     6,     7,     8,
           9,    10,    11,    12,    13,    14])

In [12]:
sliceD = arr0[arr0 <5]
sliceD

array([0, 2, 3, 4])

## Pandas

Pandas contains data structures and manipulating tools to clean as well as analyse tabular and heterogenous data, having two main structures: Series and DataFrames.

In [21]:
import pandas as pd

<b>Series</b> is one dimensional array of values with labels called 'index'.
(One could also create the series from dictionary (having indices as the keys)).

In [15]:
ts = pd.Series([4, 7, -5, 3, 2, 2], index=["2018", "2019", "2020", "2021","2022", "2023"])
ts

2018    4
2019    7
2020   -5
2021    3
2022    2
2023    2
dtype: int64

In [16]:
print("Indices of ts:",ts.index)
print("Values of ts",ts.values)
print("Values of ts in 2023 is:",ts["2023"])

Indices of ts: Index(['2018', '2019', '2020', '2021', '2022', '2023'], dtype='object')
Values of ts [ 4  7 -5  3  2  2]
Values of ts in 2023 is: 2


In [17]:
ts1 = pd.Series([4, 7, -5, 3], index=["d", "b", "a", "c"])
ts1

d    4
b    7
a   -5
c    3
dtype: int64

In [18]:
ts0 = [float("NaN"), float("nan"), None]
ts2 = pd.Series(ts0)
print(ts2)
pd.isnull(ts2)

0   NaN
1   NaN
2   NaN
dtype: float64


0    True
1    True
2    True
dtype: bool

## Dataframes

<b>Dataframes</b> represents a (rectagular) table of data, in which every column can be a different type, having a row and a column index.


One way of creating a dataframe manually is via dictionary of equaly-lengthed lists.


In [22]:
data = {"state": ["Berlin", "Berlin", "Berlin", "Hamburg", "Hamburg", "Hamburg"],
        "year": [2020, 2021, 2022, 2021, 2022, 2023],
        "pop": [3.5, 3.7, 3.6, 1.4, 1.6, 1.6]}
df = pd.DataFrame(data)
df.head()

Unnamed: 0,state,year,pop
0,Berlin,2020,3.5
1,Berlin,2021,3.7
2,Berlin,2022,3.6
3,Hamburg,2021,1.4
4,Hamburg,2022,1.6


In [23]:
#df["year"]
df.year
df.loc[1]

state    Berlin
year       2021
pop         3.7
Name: 1, dtype: object

If the nested dictionary is passed to dataframe, the outer keys are interpreted as the columns and the inner keys as the indices.

In [24]:
data2 = {"Hamburg": {2000: 1.5, 2001: 1.7, 2002: 3.6},
               "Berlin": {2001: 3.4, 2002: 3.9}}
df2 = pd.DataFrame(data2)
df2
df2.index.name = "year"
df2.columns.name = "state"
df2
# df.T
# pd.DataFrame(data2, index=[2001, 2002, 2003])

state,Hamburg,Berlin
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2000,1.5,
2001,1.7,3.4
2002,3.6,3.9


In [25]:
df3 = df2.drop('Berlin',axis=1)
df4 = df3.drop(2001)
df4

state,Hamburg
year,Unnamed: 1_level_1
2000,1.5
2002,3.6


Adding two dataframes together retur dataframe whose index and columns are the unions of original ones. Rows and columns that are not found in both original dataframes are filled as missings.

In [26]:
dfA = pd.DataFrame(np.arange(9.).reshape((3, 3)), columns=list("bcd"),
                   index=["Ohio", "Texas", "Colorado"])
dfB = pd.DataFrame(np.arange(12.).reshape((4, 3)), columns=list("bde"),
                   index=["Utah", "Ohio", "Texas", "Oregon"])
print(dfA)
print(dfB)
dfC1 = dfA + dfB
print(dfC1)
dfC2 = dfA.add(dfB, fill_value=0)
print(dfC2)

            b    c    d
Ohio      0.0  1.0  2.0
Texas     3.0  4.0  5.0
Colorado  6.0  7.0  8.0
          b     d     e
Utah    0.0   1.0   2.0
Ohio    3.0   4.0   5.0
Texas   6.0   7.0   8.0
Oregon  9.0  10.0  11.0
            b   c     d   e
Colorado  NaN NaN   NaN NaN
Ohio      3.0 NaN   6.0 NaN
Oregon    NaN NaN   NaN NaN
Texas     9.0 NaN  12.0 NaN
Utah      NaN NaN   NaN NaN
            b    c     d     e
Colorado  6.0  7.0   8.0   NaN
Ohio      3.0  1.0   6.0   5.0
Oregon    9.0  NaN  10.0  11.0
Texas     9.0  4.0  12.0   8.0
Utah      0.0  NaN   1.0   2.0


Pandas' objects also have a various common stats and math summarizing methods.

In [27]:
df5 = pd.DataFrame([[1.4, np.nan], [7.1, -4.5],
                   [np.nan, np.nan], [0.75, -1.3]],
                  index=["a", "b", "c", "d"],
                  columns=["A", "B"])
df5

Unnamed: 0,A,B
a,1.4,
b,7.1,-4.5
c,,
d,0.75,-1.3


In [28]:
#df5.sum()
df5.mean(axis='rows', skipna=True)

A    3.083333
B   -2.900000
dtype: float64

In [29]:
# conda install pandas-datareader
# import pandas_datareader.data as pdr
# !pip install yfinance
import yfinance as yf

In [30]:
DataAll = {ticker: yf.download(ticker, start='2022-01-01', end='2022-02-28') for ticker in ['AAPL', 'MSFT', 'GOOG']}

[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed


In [169]:
DataAll


{'AAPL':                  Open       High        Low      Close  Adj Close     Volume
 Date                                                                        
 2020-01-02  74.059998  75.150002  73.797501  75.087502  73.449387  135480400
 2020-01-03  74.287498  75.144997  74.125000  74.357498  72.735306  146322800
 2020-01-06  73.447502  74.989998  73.187500  74.949997  73.314888  118387200
 2020-01-07  74.959999  75.224998  74.370003  74.597504  72.970085  108872000
 2020-01-08  74.290001  76.110001  74.290001  75.797501  74.143906  132079200
 2020-01-09  76.809998  77.607498  76.550003  77.407501  75.718781  170108400
 2020-01-10  77.650002  78.167503  77.062500  77.582497  75.889969  140644800
 2020-01-13  77.910004  79.267502  77.787498  79.239998  77.511307  121532000
 2020-01-14  79.175003  79.392502  78.042503  78.169998  76.464638  161954400
 2020-01-15  77.962502  78.875000  77.387497  77.834999  76.136948  121923600
 2020-01-16  78.397499  78.925003  78.022499  78.809998 