#### <b>Pandas</b>
<font size=2>
Pandas contains high-level data structures and manipulation tools designed to make data
analysis fast and easy in Python. pandas is built on top of NumPy and makes it easy to
use in NumPy-centric applications.

Pandas is widely used in data science, machine learning, and data analysis due to its ability to handle large datasets and perform complex data operations efficiently.
</font>

In [106]:
import pandas as pd
import numpy as np      # importing numpy just in case

<font size=3>
Two major data types in Pandas
</font>
<font size=2>

- <b>DataFrame:</b> A two-dimensional, size-mutable, and heterogeneous tabular data structure with labeled axes (rows and columns). It is similar to a spreadsheet or SQL table.
- <b>Series:</b> A one-dimensional, labeled array that can hold any data type. It is like a single column in a DataFrame.

Let's get started with

</font>
<br>
<b>Series</b>
<font size=2>Series is formed from only an array of data:</font>

In [107]:
s1 = pd.Series([4, 7, -5, 3])
s1

0    4
1    7
2   -5
3    3
dtype: int64

<font size=2>
The string representation of a Series displayed interactively shows the index on the left
and the values on the right. Since we did not specify an index for the data, a default
one consisting of the integers 0 through N - 1 (where N is the length of the data) is
created. You can get the array representation and index object of the Series via its values
and index attributes, respectively:
</font>

In [108]:
print(s1.values)
s1.index

[ 4  7 -5  3]


RangeIndex(start=0, stop=4, step=1)

<font size=2>since the above index attribute is returning a range index object we can loop over it</font>

In [109]:
for i in s1.index:
    print(i, end=" ")

0 1 2 3 

we can assign our own index for unique identification of each element

In [110]:
s2 = pd.Series([4, 7, -5, 3], index=['a','b','c','d'])
s2

a    4
b    7
c   -5
d    3
dtype: int64

<font size=2>
Compared with a regular NumPy array, you can use values in the index when selecting
single values or a set of values
</font>

In [111]:
print("Accessing element: ",s2['a'],"\n") # accessing the element from series using index
s2['d'] = 6     # assiging element using index
print("Accessing elements using multiple indexes:")
s2[['c', 'a', 'd']] # accessing multiple element using an 
                    # array of indexes

Accessing element:  4 

Accessing elements using multiple indexes:


c   -5
a    4
d    6
dtype: int64

<font size=2>
NumPy array operations, such as filtering with a boolean array, scalar multiplication,
or applying math functions, will preserve the index-value link:

</font>

In [112]:
s2[s2 > 0]

a    4
b    7
d    6
dtype: int64

In [113]:
s2 * 2

a     8
b    14
c   -10
d    12
dtype: int64

In [114]:
np.exp(s2)

a      54.598150
b    1096.633158
c       0.006738
d     403.428793
dtype: float64

<font size=2>It can be substituted into many functions that expect a
dict:</font>

In [115]:
print('b' in s2)
print('e' in s2)

True
False


<font size=2>you can create a series from a dictionary as well</font>

In [116]:
sdata = {'Ohio': 35000,
         'Texas': 71000,
         'Oregon': 16000,
         'Utah': 5000}
s3 = pd.Series(sdata)
s3

Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64

<font size=2>what if there are some values not present in series</font>

In [117]:
data = data = {
    'A': 10,
    'B': None,
    'C': 30,
    'D': None,
    'E': 50
}
null_data = pd.Series(data)
print(null_data)

A    10.0
B     NaN
C    30.0
D     NaN
E    50.0
dtype: float64


<font size=2>Above we can see that there is some data not present in series and it is present as <b>None</b> in data but in pandas series it shows <b>NaN</b>.

Check if there is some null value present in data series or not we use isnull() method
</font>

In [118]:
null_data.isnull()

A    False
B     True
C    False
D     True
E    False
dtype: bool

<font size=2><b>sum()</b> method can also be used along with <b>isnull()</b> method to check the number of null values</font>

In [119]:
print("Null values in null_data: ",null_data.isnull().sum())

Null values in null_data:  2


<font size=2>Both the Series object itself and its index have a name attribute, which integrates with
other key areas of pandas functionality:</font>

In [120]:
null_data.name = "rank"
null_data.index.name = "grade"

null_data

grade
A    10.0
B     NaN
C    30.0
D     NaN
E    50.0
Name: rank, dtype: float64

<b>DataFrame</b>
<font size=2>

A DataFrame represents a tabular, spreadsheet-like data structure containing an ordered collection of columns, each of which can be a different value type (numeric,
string, boolean, etc.).
 
The DataFrame has both a row and column index; it can be
thought of as a dict of Series (one for all sharing the same index).
</font>

Creation
<font size=2>

There are numerous ways to construct a DataFrame, though one of the most common
is from a dict of equal-length lists or NumPy arrays</font>


In [121]:
df = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
 'year': [2000, 2001, 2002, 2001, 2002],
 'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}
frame = pd.DataFrame(df)

frame

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9


<font size=2>
If you specify a sequence of columns, the DataFrame’s columns will be exactly what
you pass:
</font>

In [122]:
pd.DataFrame(df, columns=['year', 'state', 'pop'])

Unnamed: 0,year,state,pop
0,2000,Ohio,1.5
1,2001,Ohio,1.7
2,2002,Ohio,3.6
3,2001,Nevada,2.4
4,2002,Nevada,2.9


<font size=2>
As with Series, if you pass a column that isn’t contained in data, it will appear with NA
values in the result:
</font>

In [123]:
frame2 = pd.DataFrame(df, 
                      columns=['year', 'state', 'pop', 'debt'],
                      index=['one', 'two', 'three', 'four', 'five'])
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,
five,2002,Nevada,2.9,


<font size=2>
A column in a DataFrame can be retrieved as a Series either by dict-like notation or by
attribute
</font>

In [124]:
display(frame2['year'])
display(frame2.state)

one      2000
two      2001
three    2002
four     2001
five     2002
Name: year, dtype: int64

one        Ohio
two        Ohio
three      Ohio
four     Nevada
five     Nevada
Name: state, dtype: object

<font size=2>rows can also be accessed by position or name using <b>.loc()</b> or <b>.iloc()</b> methods.

- .loc uses user defined names or indexes to access the data

- .iloc uses system defined numerical indexing(0,1,2...) to access data
</font>

In [125]:
frame2.iloc[0]

year     2000
state    Ohio
pop       1.5
debt      NaN
Name: one, dtype: object

In [126]:
frame2.loc['four']

year       2001
state    Nevada
pop         2.4
debt        NaN
Name: four, dtype: object

<font size=2>columns can be assigned values by assignment operators</font>

In [127]:
frame2.debt = 10
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,10
two,2001,Ohio,1.7,10
three,2002,Ohio,3.6,10
four,2001,Nevada,2.4,10
five,2002,Nevada,2.9,10


<font size=2>When assigning lists or arrays to a column, the value’s length must match the length
of the DataFrame. If you assign a Series, it will be instead conformed exactly to the
DataFrame’s index, inserting missing values in any holes:</font>

In [128]:
val = pd.Series([-1.2, -1.5, -1.7], index=['two', 'four', 'five'])
frame2['debt'] = val
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,-1.2
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,-1.5
five,2002,Nevada,2.9,-1.7


<font size=2>Assigning a column that doesn’t exist will create a new column. The del keyword will
delete columns as with a dict:</font>

In [129]:
frame2['eastern'] = frame2.state == 'Ohio'
frame2


Unnamed: 0,year,state,pop,debt,eastern
one,2000,Ohio,1.5,,True
two,2001,Ohio,1.7,-1.2,True
three,2002,Ohio,3.6,,True
four,2001,Nevada,2.4,-1.5,False
five,2002,Nevada,2.9,-1.7,False


In [130]:
del frame2['eastern']
frame2.columns

Index(['year', 'state', 'pop', 'debt'], dtype='object')

<font size=2><b>Another common form of data is a nested dict of dicts format:</b></font>

In [131]:
pop = {'Nevada': {2001: 2.4, 2002: 2.9},
       'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}

<font size=2>If passed to DataFrame, it will interpret the outer dict keys as the columns and the inner
keys as the row indices:</font>

In [132]:
frame3 = pd.DataFrame(pop)
frame3

Unnamed: 0,Nevada,Ohio
2001,2.4,1.7
2002,2.9,3.6
2000,,1.5


<font size=2>Like Series, the values attribute returns the data contained in the DataFrame as a 2D
ndarray:</font>

In [133]:
frame3.values

array([[2.4, 1.7],
       [2.9, 3.6],
       [nan, 1.5]])

In [134]:
frame2.values

array([[2000, 'Ohio', 1.5, nan],
       [2001, 'Ohio', 1.7, -1.2],
       [2002, 'Ohio', 3.6, nan],
       [2001, 'Nevada', 2.4, -1.5],
       [2002, 'Nevada', 2.9, -1.7]], dtype=object)

<b>Index Objects</b>

pandas’s Index objects are responsible for holding the axis labels and other metadata
(like the axis name or names). Any array or other sequence of labels used when constructing a Series or DataFrame is internally converted to an Index:

In [135]:
obj = pd.Series(range(3), index=['a', 'b', 'c'])

In [136]:
index = obj.index
index

Index(['a', 'b', 'c'], dtype='object')

In [137]:
index[1:]

Index(['b', 'c'], dtype='object')

<font size=2>Index objects are immutable and thus can’t be modified by the user.

In addition to being array-like, an Index also functions as a fixed-size set:
<font>

In [138]:
frame3

Unnamed: 0,Nevada,Ohio
2001,2.4,1.7
2002,2.9,3.6
2000,,1.5


In [139]:
print('Ohio' in frame3.columns)
print(2003 in frame3.index)

True
False


<b>Rindexing</b>

<font size=2>Calling reindex on this Series rearranges the data according to the new index, introducing missing values if any index values were not already present:</font>


In [140]:
rdx = pd.Series([4.5, 7.2, -5.3, 3.6],
              index=['d', 'b', 'a', 'c'])
rdx

d    4.5
b    7.2
a   -5.3
c    3.6
dtype: float64

In [141]:
rdx2 = rdx.reindex(['a','b','c','d','e'])
rdx2

a   -5.3
b    7.2
c    3.6
d    4.5
e    NaN
dtype: float64

In [142]:
obj.reindex(['a', 'b', 'c', 'd', 'e'], fill_value=0)

a    0
b    1
c    2
d    0
e    0
dtype: int64

<font size=2>For ordered data like time series, it may be desirable to do some interpolation or filling
of values when reindexing. The method option allows us to do this, using a method such
as ffill which forward fills the values:
</font>

In [143]:
rdx3 = pd.Series(['blue', 'purple', 'yellow'], index=[0, 2, 4])
# rdx3
rdx3.reindex(range(6),method='ffill')

0      blue
1      blue
2    purple
3    purple
4    yellow
5    yellow
dtype: object

<font size =2 >With DataFrame, reindex can alter either the (row) index, columns, or both. When
passed just a sequence, the rows are reindexed in the result:
</font>

In [144]:
df2 = pd.DataFrame(np.arange(9).reshape((3, 3)),
                index=['a', 'c', 'd'],
                columns=['Ohio', 'Texas', 'California'])
df2

Unnamed: 0,Ohio,Texas,California
a,0,1,2
c,3,4,5
d,6,7,8


In [145]:
df2 = df2.reindex(['a', 'b', 'c', 'd'])
df2

Unnamed: 0,Ohio,Texas,California
a,0.0,1.0,2.0
b,,,
c,3.0,4.0,5.0
d,6.0,7.0,8.0


<font size=2>The columns can be reindexed using the columns keyword:</font>

In [146]:
states = ['Texas', 'Utah', 'California']
df2.reindex(columns=states)

Unnamed: 0,Texas,Utah,California
a,1.0,,2.0
b,,,
c,4.0,,5.0
d,7.0,,8.0


<b>Drop data</b>

<font size=2>Dropping entries from an axis</font>

In [147]:
data_drop = pd.Series(np.arange(5.),
                      index=['a', 'b', 'c', 'd', 'e'])
data_drop

a    0.0
b    1.0
c    2.0
d    3.0
e    4.0
dtype: float64

In [148]:
new_obj = data_drop.drop('c')
new_obj

a    0.0
b    1.0
d    3.0
e    4.0
dtype: float64

In [149]:
data_drop.drop(['d', 'c'])

a    0.0
b    1.0
e    4.0
dtype: float64

With DataFrame, index values can be deleted from either axis:

In [150]:
state_data = pd.DataFrame(np.arange(16).reshape((4, 4)),
                index=['Ohio', 'Colorado', 'Utah', 'New York'],
                columns=['one', 'two', 'three', 'four'])
state_data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [151]:
# dropping rows
state_data.drop(['Colorado', 'Ohio'])

Unnamed: 0,one,two,three,four
Utah,8,9,10,11
New York,12,13,14,15


In [152]:
# dropping column data
display(state_data.drop('two', axis=1))
display(state_data.drop(['two','four'], axis=1))

Unnamed: 0,one,three,four
Ohio,0,2,3
Colorado,4,6,7
Utah,8,10,11
New York,12,14,15


Unnamed: 0,one,three
Ohio,0,2
Colorado,4,6
Utah,8,10
New York,12,14


In [153]:
arr = np.array([[1,2,3,4],
                [5,6,7,8]])
print(arr.sum())
print(arr.sum(axis=1))

36
[10 26]


<b>Indexing, Selection and Filtering</b>

<font size=2>Series indexing (obj[...]) works analogously to NumPy array indexing, except you can use the Series’s index values instead of only integers.</font>

In [154]:
obj2 = pd.Series(np.arange(4.), index=['a', 'b', 'c', 'd'])
print(obj2['b'])
print(obj2[1])
print(obj2[2:4])

1.0
1.0
c    2.0
d    3.0
dtype: float64


In [155]:
print(obj2[obj2 < 2])
# Slicing with labels behaves differently than normal Python slicing in that the endpoint
# is inclusive:
obj['b':'c']

a    0.0
b    1.0
dtype: float64


b    1
c    2
dtype: int64

In [156]:
# setting a number of values to a perticular value
obj2['b':'c'] = 6
obj2

a    0.0
b    6.0
c    6.0
d    3.0
dtype: float64

<font size=2>For DataFrame label-indexing on the rows, the special indexing field ix. is used It enables you to select a subset of the rows and columns from a DataFrame with NumPylike notation plus axis labels.</font>

In [157]:
state_data2 =  pd.DataFrame(np.arange(16).reshape((4, 4)),
                index=['Ohio', 'Colorado', 'Utah', 'New York'],
                columns=['one', 'two', 'three', 'four'])
state_data2

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [158]:
state_data2.loc['Ohio',['two','three']]

two      1
three    2
Name: Ohio, dtype: int32

<h2>Note:</h2>

<b>Use .loc() and .iloc() for accessing the rows

Columns can be accessed using column names in square brackets<b>

<b>Arithmetic and data alignment</b>

In [159]:
s1 = pd.Series([7.3, -2.5, 3.4, 1.5], index=['a', 'c', 'd', 'e'])
s2 = pd.Series([-2.1, 3.6, -1.5, 4, 3.1], index=['a', 'c', 'e', 'f', 'g'])

display(s1)
display(s2)

a    7.3
c   -2.5
d    3.4
e    1.5
dtype: float64

a   -2.1
c    3.6
e   -1.5
f    4.0
g    3.1
dtype: float64

<font size=2>The internal data alignment introduces NA values in the indices that don’t overlap.</font>

In [160]:
s1 + s2

a    5.2
c    1.1
d    NaN
e    0.0
f    NaN
g    NaN
dtype: float64

In the case of DataFrame, alignment is performed on both the rows and the columns:

In [161]:
df1 = pd.DataFrame(np.arange(9.).reshape((3, 3)), columns=list('bcd'),
                index=['Ohio', 'Texas', 'Colorado'])
df2 = pd.DataFrame(np.arange(12.).reshape((4, 3)), columns=list('bde'),
                index=['Utah', 'Ohio', 'Texas', 'Oregon'])
display(df1)
display(df2)


Unnamed: 0,b,c,d
Ohio,0.0,1.0,2.0
Texas,3.0,4.0,5.0
Colorado,6.0,7.0,8.0


Unnamed: 0,b,d,e
Utah,0.0,1.0,2.0
Ohio,3.0,4.0,5.0
Texas,6.0,7.0,8.0
Oregon,9.0,10.0,11.0


In [162]:
df1+df2

Unnamed: 0,b,c,d,e
Colorado,,,,
Ohio,3.0,,6.0,
Oregon,,,,
Texas,9.0,,12.0,
Utah,,,,


In [163]:
df1 = pd.DataFrame(np.arange(12.).reshape((3, 4)), columns=list('abcd'))
df2 = pd.DataFrame(np.arange(20.).reshape((4, 5)), columns=list('abcde'))
display(df1)
display(df2)


Unnamed: 0,a,b,c,d
0,0.0,1.0,2.0,3.0
1,4.0,5.0,6.0,7.0
2,8.0,9.0,10.0,11.0


Unnamed: 0,a,b,c,d,e
0,0.0,1.0,2.0,3.0,4.0
1,5.0,6.0,7.0,8.0,9.0
2,10.0,11.0,12.0,13.0,14.0
3,15.0,16.0,17.0,18.0,19.0


In [164]:
df1 + df2

Unnamed: 0,a,b,c,d,e
0,0.0,2.0,4.0,6.0,
1,9.0,11.0,13.0,15.0,
2,18.0,20.0,22.0,24.0,
3,,,,,


In [165]:

df1.add(df2, fill_value=0)

Unnamed: 0,a,b,c,d,e
0,0.0,2.0,4.0,6.0,4.0
1,9.0,11.0,13.0,15.0,9.0
2,18.0,20.0,22.0,24.0,14.0
3,15.0,16.0,17.0,18.0,19.0


<b>Function Application and Mapping</b>
<font size=2>

NumPy ufuncs (element-wise array methods) work fine with pandas objects:</font>

In [166]:
frame = pd.DataFrame(np.random.randn(4, 3), columns=list('bde'),
                    index=['Utah', 'Ohio', 'Texas', 'Oregon'])

frame

Unnamed: 0,b,d,e
Utah,-0.337689,0.966262,-0.600981
Ohio,-0.695704,-0.359518,-0.545284
Texas,-0.284484,0.201516,-0.447255
Oregon,1.668014,1.916706,0.770955


In [167]:
np.abs(frame)

Unnamed: 0,b,d,e
Utah,0.337689,0.966262,0.600981
Ohio,0.695704,0.359518,0.545284
Texas,0.284484,0.201516,0.447255
Oregon,1.668014,1.916706,0.770955


<font size=2>Another frequent operation is applying a function on 1D arrays to each column or row.
DataFrame’s apply method does exactly this:</font>

In [168]:
f = lambda x: x.max() - x.min()
frame.apply(f)

b    2.363718
d    2.276224
e    1.371937
dtype: float64

In [169]:
frame.min()

b   -0.695704
d   -0.359518
e   -0.600981
dtype: float64

<font size=2><b>Note:</b> Many of the most common array statistics (like sum and mean) are DataFrame methods,
so using apply is not necessary.</font>

<font size=2>The function passed to apply need not return a scalar value, it can also return a Series
with multiple values:</font>

In [170]:
def f(x):
    return pd.Series([x.min(), x.max()], index=['min', 'max'])

frame.apply(f)

Unnamed: 0,b,d,e
min,-0.695704,-0.359518,-0.600981
max,1.668014,1.916706,0.770955


<b>Sorting and Ranking</b>
<font size=2>

Sorting a data set by some criterion is another important built-in operation. To sort
lexicographically by row or column index, use the sort_index method, which returns
a new, sorted object:

</font>

In [171]:
obj = pd.Series(range(4), index=['d', 'a', 'b', 'c'])
obj

d    0
a    1
b    2
c    3
dtype: int64

In [172]:
obj.sort_index()

a    1
b    2
c    3
d    0
dtype: int64

In [173]:
obj.sort_values()

d    0
a    1
b    2
c    3
dtype: int64

<font size=2>With a DataFrame, you can sort by index on either axis:</font>

In [174]:
frame = pd.DataFrame(np.arange(8).reshape((2, 4)), index=['three', 'one'],
                  columns=['d', 'a', 'b', 'c'])
frame

Unnamed: 0,d,a,b,c
three,0,1,2,3
one,4,5,6,7


In [175]:
frame.sort_index() 

Unnamed: 0,d,a,b,c
one,4,5,6,7
three,0,1,2,3


In [176]:
frame.sort_index(axis=1)

Unnamed: 0,a,b,c,d
three,1,2,3,0
one,5,6,7,4


In [177]:
frame.sort_index(axis=1, ascending=False)

Unnamed: 0,d,c,b,a
three,0,3,2,1
one,4,7,6,5


Ranking is closely related to sorting, assigning ranks from one through the number of
valid data points in an array. It is similar to the indirect sort indices produced by
numpy.argsort, except that ties are broken according to a rule

In [178]:
obj = pd.Series([7, -5, 7, 4, 2, 0, 4])
obj

0    7
1   -5
2    7
3    4
4    2
5    0
6    4
dtype: int64

In [179]:
obj.rank()

0    6.5
1    1.0
2    6.5
3    4.5
4    3.0
5    2.0
6    4.5
dtype: float64

<b>Summarizing and Computing Descriptive Statistics</b>

<font size=2>pandas objects are equipped with a set of common mathematical and statistical methods. Most of these fall into the category of reductions or summary statistics, methods
that extract a single value (like the sum or mean) from a Series or a Series of values from
the rows or columns of a DataFrame. Compared with the equivalent methods of vanilla
NumPy arrays, they are all built from the ground up to exclude missing data.</font>

<font size=2>Consider the following dataframe:</font>

In [180]:
df = pd.DataFrame([[1.4, np.nan], [7.1, -4.5],
                   [np.nan, np.nan], [0.75, -1.3]],
                   index=['a', 'b', 'c', 'd'],
                   columns=['one', 'two'])
df

Unnamed: 0,one,two
a,1.4,
b,7.1,-4.5
c,,
d,0.75,-1.3


In [181]:
# the sum method
df.sum()

one    9.25
two   -5.80
dtype: float64

In [182]:
# Passing axis=1 sums over the rows instead:
df.sum(axis=1)

a    1.40
b    2.60
c    0.00
d   -0.55
dtype: float64

In [183]:
df.mean(axis=1, skipna=False)

a      NaN
b    1.300
c      NaN
d   -0.275
dtype: float64

<font size=2>Some methods, like idxmin and idxmax, return indirect statistics like the index value
where the minimum or maximum values are attained:
</font>

In [184]:
df.idxmax()

one    b
two    d
dtype: object

In [185]:
df.cumsum()

Unnamed: 0,one,two
a,1.4,
b,8.5,-4.5
c,,
d,9.25,-5.8


<font size=2> `.describe()` is one
such example, producing multiple summary statistics in one shot:</font>


In [186]:
df.describe()

Unnamed: 0,one,two
count,3.0,2.0
mean,3.083333,-2.9
std,3.493685,2.262742
min,0.75,-4.5
25%,1.075,-3.7
50%,1.4,-2.9
75%,4.25,-2.1
max,7.1,-1.3


<font size=2>On non-numeric data, describe produces alternate summary statistics:</font>

In [187]:
obj = pd.Series(['a', 'a', 'b', 'c'] * 4)
obj.describe()

count     16
unique     3
top        a
freq       8
dtype: object

In [188]:
# pip install yfinance

In [189]:
# import yfinance as yf
# import pandas as pd

# # Dictionary to hold stock data
# all_data = {}

# # Loop through tickers and retrieve data
# for ticker in ['AAPL', 'IBM', 'MSFT', 'GOOG']:
#     all_data[ticker] = yf.download(ticker, start='2011-01-01', end='2020-01-01')

# # Create a DataFrame for 'Adj Close' prices
# price = pd.DataFrame({tic: data['Adj Close'] for tic, data in all_data.items()})

# # Create a DataFrame for 'Volume'
# volume = pd.DataFrame({tic: data['Volume'] for tic, data in all_data.items()})

# print(price.head())  # Example: print the first few rows of the price DataFrame
# print(volume.head()) # Example: print the first few rows of the volume DataFrame


In [190]:
# volume

In [191]:
# price

In [192]:
# returns = price.pct_change()
# returns.tail()

<font size=2>The corr method of Series computes the correlation of the overlapping, non-NA,
aligned-by-index values in two Series. Relatedly, cov computes the covariance:</font>

In [193]:
# returns.MSFT.corr(returns.IBM)

In [194]:
# returns.MSFT.cov(returns.IBM)

<font size=2>
DataFrame’s corr and cov methods, on the other hand, return a full correlation or
covariance matrix as a DataFrame, respectively:
</font>

In [195]:
# returns.corr()

<font size=2>Passing a DataFrame computes the correlations of matching column names. Here I
compute correlations of percent changes with volume:</font>

In [196]:
# returns.corrwith(volume)

<b>`Unique Values, Value Counts, and Membership`</b>

<font size=2>
There are some other methods that extracts information about the values conatined in 1-dimensional Series.
</font>

In [197]:
data = ['apple', 'banana', 'apple', 'orange', 'banana', 'grape', 'apple']
obj = pd.Series(data)
print(obj)

0     apple
1    banana
2     apple
3    orange
4    banana
5     grape
6     apple
dtype: object


<font size =2><b>Unique values</b></font>

In [198]:
obj.unique()

array(['apple', 'banana', 'orange', 'grape'], dtype=object)

<font size=2><b>Value Count</b> counts the frequency of the values</font>

In [199]:
obj.value_counts()

apple     3
banana    2
orange    1
grape     1
dtype: int64

In [200]:
mask = obj.isin(['apple','grape'])
mask

0     True
1    False
2     True
3    False
4    False
5     True
6     True
dtype: bool

In [201]:
obj[mask]

0    apple
2    apple
5    grape
6    apple
dtype: object

<font size=2>In some cases, you may want to compute a histogram on multiple related columns in a DataFrame.</font>

In [202]:
data = pd.DataFrame({'Qu1': [1, 3, 4, 3, 4],
                     'Qu2': [2, 3, 1, 2, 3],
                     'Qu3': [1, 5, 2, 4, 4]})
data

Unnamed: 0,Qu1,Qu2,Qu3
0,1,2,1
1,3,3,5
2,4,1,2
3,3,2,4
4,4,3,4


In [203]:
result = data.apply(pd.value_counts).fillna(0)
result

Unnamed: 0,Qu1,Qu2,Qu3
1,1.0,1.0,1.0
2,0.0,2.0,1.0
3,2.0,2.0,0.0
4,2.0,0.0,2.0
5,0.0,0.0,1.0


<b>Handling Missing Values</n>

<font size=2>pandas uses the floating point value NaN (Not a Number) to represent missing data in
both floating as well as in non-floating point arrays. It is just used as a sentinel that can
be easily detected:</font>

In [204]:
string_data = pd.Series(['aardvark', 'artichoke', np.nan, 'avocado'])
string_data

0     aardvark
1    artichoke
2          NaN
3      avocado
dtype: object

In [205]:
string_data.isnull()

0    False
1    False
2     True
3    False
dtype: bool

<b>Filtering Out Missing Data</b>

<font size=2>There are many opetions for filtering the missing data but one of the most helpful method is `dropna()`</font>


In [206]:
from numpy import nan as NA

In [207]:
data = pd.Series([1,NA,3.5,NA,7])
data

0    1.0
1    NaN
2    3.5
3    NaN
4    7.0
dtype: float64

In [208]:
data.dropna()

0    1.0
2    3.5
4    7.0
dtype: float64

In [209]:
data[data.notnull()]

0    1.0
2    3.5
4    7.0
dtype: float64

In [210]:
data = pd.DataFrame([[1, 6.5, 3.], [1., NA, NA],
                  [NA, NA, NA], [NA, 6.5, 3.]])

In [211]:
data

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


In [212]:
data.dropna()

Unnamed: 0,0,1,2
0,1.0,6.5,3.0


<font size=2>Passing how='all' will only drop rows that are all NA</font>

In [213]:
data.dropna(how='all')

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
3,,6.5,3.0


<font size=2>Dropping columns in the same way is only a matter of passing axis=1:</font>

In [214]:
data[4] = NA
data

Unnamed: 0,0,1,2,4
0,1.0,6.5,3.0,
1,1.0,,,
2,,,,
3,,6.5,3.0,


In [218]:
data.dropna(axis=1,how='all')

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


Fillin in Missing Data

In [222]:
df = pd.DataFrame(np.random.randn(7, 3))
df.iloc[:4, 1] = NA; df.iloc[:2, 2] = NA
df

Unnamed: 0,0,1,2
0,-1.720174,,
1,-1.619891,,
2,0.56176,,0.208501
3,-1.691531,,-1.571437
4,0.013016,-0.747895,-0.31574
5,-0.079835,0.384066,-0.944203
6,1.007579,-0.5728,0.298928


In [223]:
df.dropna(thresh=3)

Unnamed: 0,0,1,2
4,0.013016,-0.747895,-0.31574
5,-0.079835,0.384066,-0.944203
6,1.007579,-0.5728,0.298928
