# Introduction to pandas Data Structures
To get started with pandas, you will need to get comfortable with its two workhorse
data structures:<b> Series and DataFrame</b>. While they are not a universal solution for
every problem, they provide a solid, easy-to-use basis for most applications.

# Pandas Series Object

<b>A Series</b> is the primary building block of pandas.

Series represents a one-dimensional labeled indexed array based on the NumPy ndarray.

Like an array, a Series can hold zero or more values of any single data type

# Creating Series
A Series can be created and initialized by passing either a <b>scalar value,
a NumPy ndarray, a Python list, or a Python Dict</b> as the data parameter of
the Series constructor. This is the default parameter and does not need to
be specified if it is the first item.

In [None]:
import numpy as np
import pandas as pd

In [None]:
# create one item Series
s1 = pd.Series(2)
s1

0    2
dtype: int64

In [None]:
# create a series of multiple items from a list
s2 = pd.Series([1, 2, 3, 4, 5])
s2

0    1
1    2
2    3
3    4
4    5
dtype: int64

In [None]:
# get the values in the Series
s2.values

array([1, 2, 3, 4, 5], dtype=int64)

In [None]:
# get the index of the Series
s2.index

In [None]:
# explicitly create an index
# index is alpha, not integer
s3 = pd.Series([1, 2, 3], index=['a', 'b', 'c']) 
s3

a    1
b    2
c    3
dtype: int64

In [None]:
# lookup by label value, not integer position
print(f"vlaue by label 's3['c']' is {s3['c']} and vlaue by index 's3[2] 'is {s3[2]}")
# access both by label and index

vlaue by label 's3['c']' is 3 and vlaue by index 's3[2] 'is 3


In [None]:
# create Series from an existing index
# scalar value with be copied at each index label
s4 = pd.Series(["A","B","C","D","E"], index=s2.index)
s4

0    A
1    B
2    C
3    D
4    E
dtype: object

In [None]:
# create Series from dict
s4 = pd.Series({'a': 1, 'b': 2, 'c': 3, 'd': 4})
s4

a    1
b    2
c    3
d    4
dtype: int64

In [None]:
s5 = pd.Series(np.array([22,33,44,55,66]))
s5

0    22
1    33
2    44
3    55
4    66
dtype: int32

<h3>Size, shape, uniqueness, and counts of values</h3>

In [None]:
# example series, which also contains a NaN
s = pd.Series([0, 1, 1, 2, 3, 4, 4, 5, 6, 7,np.nan]) # numpy nan property is used here to create an NaN 
s

0     0.0
1     1.0
2     1.0
3     2.0
4     3.0
5     4.0
6     4.0
7     5.0
8     6.0
9     7.0
10    NaN
dtype: float64

In [None]:
print(len(s))
print(s.size)
print(s.shape)
print(s.count()) # count return not null values
print(s.unique())
print(s.value_counts())

<h3>Peeking at data with heads, tails, and take</h3>

In [None]:
# first five
s.head()

In [None]:
# first three
s.head(n = 3) # s.head(3) is equivalent

In [None]:
# last five
s.tail()

In [None]:
# last 3
s.tail(n = 3) # equivalent to s.tail(3)

In [None]:
#The .take() method will return the rows in a series that correspond to the zero-based positions:

# only take specific items

s.take([9, 3, 9])

# Looking up values in Series

In [None]:
# single item lookup

print(s3)
s3['a']

In [None]:
#Accessing this Series using an integer value will perform a zero-based position lookup of the value:

# lookup by position since the index is not an integer
s3[1]

In [None]:
# multiple items
s3[['c', 'a']]

In [None]:
# series with an integer index, but not starting with 0
s5 = pd.Series([1, 2, 3], index=[2, 3, 4])  
s5

# label-based lookup versus position-based lookup

In [None]:
s5[2]  # 2 is considered as label based look up
       # coz label also has 2 init

In [None]:
s5[0]   # now see in this case we have integer label lookup,position lookup is not working

In [None]:
s5.loc[2]  # loc also works on label based look up

In [None]:
s5.iloc[2]  #iLoc forcefully works on position based look up even you dont specify position based index

In [None]:
# multiple items by label (loc)
s5.loc[[4, 3]]

In [None]:
s5[[0,2]]

In [None]:
s5.iloc[[0,2]]

In [None]:
s5.iloc[[0,2,3]]  # integr location will throw exception

# Alignment via index labels

A fundamental difference between a NumPy ndarray and a pandas Series is the
ability of a Series to automatically align data from another Series based on label
values before performing an operation.

In [None]:
s6 = pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])
s6

In [None]:
s7 = pd.Series([4, 3, 2, 1], index=['d', 'c', 'b', 'a'])
s7

In [None]:
# add them
s6 + s7    #it first alligns the data as per label then performs operation

 <h3>Nan + number = NaN </h3>      (NaN added to a number results in NaN)
    
<h3>number + NaN = NaN</h3>        (Number added to a Nan results in NaN)


In [None]:
s8 = pd.Series({'a': 1, 'b': 2, 'c': 3, 'd': 5})
s8

In [None]:
s9 = pd.Series({'b': 6, 'c': 7, 'd': 9, 'e': 10})
s9

In [None]:
# NaN's result for a and e
# demonstrates alignment
s8 + s9

In [None]:
s10 = pd.Series([1.0, 2.0, 3.0], index=['a', 'a', 'b'])
s10

In [None]:
s11 = pd.Series([4.0, 5.0, 6.0], index=['a', 'a', 'c'])
s11

When the two Series objects are added (or any other operation performed), the
resulting Series has four 'a' index labels.

In [None]:
s10 + s11

![catesion.png](attachment:catesion.png)

# The special case of Not-A-Number (NaN)

In [None]:
# mean of numpy array values
nda = np.array([1, 2, 3, 4, 5])
nda.mean()

In [None]:
# mean of numpy array values with a NaN
nda = np.array([1, 2, 3, 4, np.NaN])
nda.mean()

In [None]:
# ignores NaN values
s = pd.Series(nda)      
s.mean()

In [None]:
# handle NaN values like NumPy
s.mean(skipna=False)

# Boolean Selection

In [None]:
# which rows have values that are > 5?
s = pd.Series(np.arange(0, 10))

s > 5

In [None]:
# select rows where values are > 5
logicalResults = s > 5
s[logicalResults]

In [None]:
# a little shorter version
s[s > 5]

In [None]:
# commented as it throws an exception
# s[s > 5 and s < 8]

# correct syntax
s[(s > 5) & (s < 8)]

In [None]:
pd.Series([True, False, False, True, True]).all(),pd.Series([True, False, False, True, True]).any()

In [None]:
(np.array([1,0,1,1])).sum()

In [None]:
# are all items >= 0?
(s >= 0).all()

In [None]:
s < 2

In [None]:
# any items < 2?
s[s < 2].any()

In [None]:
# how many values < 2?
(s < 2).sum()

# Reindexing a Series

Reindexing in pandas is a process that makes the data in a Series or DataFrame
match a given set of labels. This is core to the functionality of pandas as it enables
label alignment across multiple objects, which may originally have different
indexing schemes.
This process of performing a reindex includes the following steps:
1. Reordering existing data to match a set of labels.
2. Inserting NaN markers where no data exists for a label.
3. Possibly, filling missing data for a label using some type of logic (defaulting
to adding NaN values).

In [None]:
# sample series of five items
s = pd.Series(np.random.randn(5))
s

In [None]:
# change the index
s.index = ['a', 'b', 'c', 'd', 'e']
s

let's examine a slightly more practical example. The following code concatenates
two Series objects resulting in duplicate index labels, which may not be desired in the
resulting Series:

In [None]:
# concat copies index values verbatim (as it is),
# potentially making duplicates since we have some or all label index same
np.random.seed(123456)
s1 = pd.Series(np.random.randn(3))
s2 = pd.Series(np.random.randn(3))
combined = pd.concat([s1, s2])
combined

In [None]:
# reset the index so that duplication of index may be removed
combined.index = np.arange(0, len(combined))
combined

Reindexing using the .index property in-place modifies the Series.

# reindex() method
Greater flexibility in creating a new index is provided using the .reindex() method.
An example of the flexibility of .reindex() over assigning the .index property
directly is that the list provided to .reindex() can be of a different length than the
number of rows in the Series:

In [None]:
np.random.seed(123456)
s1 = pd.Series(np.random.randn(4), ['a', 'b', 'c', 'd'])
print(s1)
# reindex with different number of labels
# results in dropped rows and/or NaN's


s2 = s1.reindex(['a', 'c', 'g'])
s2

Things to be noted: 
    1. reindex() donot re-index inplace, it will return a new series original will not be modified
    2. if any index not matching the previous index will be assigned NaN
    3. The index present in previous indexes, if not included in re-index 
        then the row will not be added in new series. 

In [None]:
combined.reindex([9,5,3,4,0,1,2,6])  # not in place

In [None]:
combined   # last indexing is still there.

Reindexing is also useful when you want to align two Series to perform an
operation on matching elements from each series; however, for some reason,
the two Series had index labels that will not initially align.
The following example demonstrates this, where the first Series has indexes as
sequential integers, but the second has a string representation of what would be
the same values:


In [None]:
# different types for the same values of labels
# causes big trouble
s1 = pd.Series([0, 1, 2], index=[0, 1, 2])
s2 = pd.Series([3, 4, 5], index=['0', '1', '2'])
s1 + s2

you can easily guess what had happened here.

all values are NaN because the operation tries to add the item in the
first series with the integer label 0, which has a value of 0, but can't find the
item in the other series and therefore, the result is NaN (and this fails six times
in this case).

<h5>Once this situation is identified:</h5>
it becomes a fairly trivial situation to fix by
reindexing the second series:

In [None]:
# reindex by casting the label types and we will get the desired result

s2.index = s2.index.values.astype(int)
s1 + s2

Overriding the default action of inserting <b>NaN</b> as a missing value during reindexing can
be changed by using the <b>fill_value</b> parameter of the method.

In [None]:
# fill with 0 instead of NaN
s2 = s.copy()
s2

In [None]:
s2_reindexed = s2.reindex(['a', 'f'], fill_value=0)
s2_reindexed

<h3>ffill, bfill, & nearest</h3>

In [None]:
# create example to demonstrate fills
s3 = pd.Series(['red', 'green', 'blue', ], index=[0, 8, 10])
s3

In [None]:
# forward fill example
s3.reindex(np.arange(0,15), method='ffill')

In [None]:
# backwards fill example
s3.reindex(np.arange(0,7), method='bfill')

In [None]:
s3.reindex(np.arange(0,10), method='nearest') #nearest: use nearest valid observations to fill gap

# Slicing a Series

In [None]:
# a Series to use for slicing
# using index labels not starting at 0 to demonstrate
# position based slicing

s = pd.Series(np.arange(100, 110), index=np.arange(10, 20))

# remember we pro 
s

In [None]:
print(s[0:6:2])  # [startofrow:endofrow:step]

# # equivalent to
s.iloc[[0, 2, 4]]

In [None]:
# first five by slicing, same as .head(5)
s[:5]

# Missing Data in Series

NaN values represent data is missing in the series


In [None]:
sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
obj3 = pd.Series(sdata)
obj3

In [None]:
states = ['California', 'Ohio', 'Oregon', 'Texas']

obj4 = pd.Series(sdata, index=states)
obj4

In [None]:
pd.isnull(obj4)  # obj4.isnull()

In [None]:
pd.notnull(obj4)  #obj4.notnull()

In [None]:
obj4.name = 'population'
obj4.index.name="state"
obj4

# The pandas DataFrame Object

A DataFrame represents a rectangular table of data and contains an ordered collection of columns, each of which can be a different value type (numeric, string,
boolean, etc.). 

<h4>Creating a DataFrame from scratch</h4>

In [None]:
# create a DataFrame from a 2-d ndarray
import pandas as pd
import numpy as np

df = pd.DataFrame(np.array([[10, 11, 12, 13], [20, 21, 22, 23]]))
df

# default row and columns indexes

Unnamed: 0,0,1,2,3
0,10,11,12,13
1,20,21,22,23


In [None]:
# create a DataFrame for a list of Series objects

df1 = pd.DataFrame([pd.Series(np.arange(10, 15)),
                    
                    pd.Series(np.arange(15, 20))])
df1
# default row and columns indexes

Unnamed: 0,0,1,2,3,4
0,10,11,12,13,14
1,15,16,17,18,19


In [None]:
# create a DataFrame with two Series objects
# and a dictionary
s1 = pd.Series(np.arange(1, 6, 1))

s2 = pd.Series(np.arange(6, 11, 1))

df2= pd.DataFrame({'boys': s1, 'girls': s2})
df2

Unnamed: 0,boys,girls
0,1,6
1,2,7
2,3,8
3,4,9
4,5,10


In [None]:
data = {'name':["Asad","Saad","Fahad", 'Ali'], 'age':[23,34,23,21], "grades":["A","B","A","B"]}
data = pd.DataFrame(data)
data

Unnamed: 0,name,age,grades
0,Asad,23,A
1,Saad,34,B
2,Fahad,23,A
3,Ali,21,B


In [None]:
# specify column names
df3 = pd.DataFrame(np.array([[10, 11], [20, 21]]),columns=['apples', 'oranges'])
df3

Unnamed: 0,apples,oranges
0,10,11
1,20,21


In [None]:
# create a DataFrame with named columns and rows

df4 = pd.DataFrame(np.array([[10, 11, 12, 13], [20, 21, 22, 23]]), 
                   index=['apples', 'oranges'],
                   columns=['Mon', 'Tue','Wed', 'Thu'])
df4

Unnamed: 0,Mon,Tue,Wed,Thu
apples,10,11,12,13
oranges,20,21,22,23


In [None]:
# demonstrate alignment during creation

s3 = pd.Series(np.arange(12, 14), index=[1, 2])

df5 = pd.DataFrame({'c1': s1, 'c2': s2, 'c3': s3})
df5

Unnamed: 0,c1,c2,c3
0,1,6,
1,2,7,12.0
2,3,8,13.0
3,4,9,
4,5,10,


In [None]:
# Examples of creating data frames

In [None]:
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
        'year':  [2000, 2001, 2002, 2001, 2002, 2003],
        'pop':   [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}

frame = pd.DataFrame(data)
frame

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9
5,Nevada,2003,3.2


In [None]:
pd.DataFrame(frame, columns=['year', 'state', 'pop']) # inplace nahi hoga 

Unnamed: 0,year,state,pop
0,2000,Ohio,1.5
1,2001,Ohio,1.7
2,2002,Ohio,3.6
3,2001,Nevada,2.4
4,2002,Nevada,2.9
5,2003,Nevada,3.2


In [None]:
frame.year

0    2000
1    2001
2    2002
3    2001
4    2002
5    2003
Name: year, dtype: int64

In [None]:
frame['pop']

0    1.5
1    1.7
2    3.6
3    2.4
4    2.9
5    3.2
Name: pop, dtype: float64

In [None]:
# If you pass a column that isn’t contained in the dict(debt), it will appear with missing values
# in the result:
frame2 = pd.DataFrame(data, columns=['year', 'state', 'pop', 'debt'],
                              index=['one', 'two', 'three', 'four','five', 'six'])
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,
five,2002,Nevada,2.9,
six,2003,Nevada,3.2,


In [None]:
frame2.debt = 100
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,100
two,2001,Ohio,1.7,100
three,2002,Ohio,3.6,100
four,2001,Nevada,2.4,100
five,2002,Nevada,2.9,100
six,2003,Nevada,3.2,100


In [None]:
frame2['debt']=np.arange(6)
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,0
two,2001,Ohio,1.7,1
three,2002,Ohio,3.6,2
four,2001,Nevada,2.4,3
five,2002,Nevada,2.9,4
six,2003,Nevada,3.2,5


In [None]:
val = pd.Series([-1.2, -1.5, -1.7], index=['two', 'four', 'five'])

In [None]:
frame2['debt'] = val
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,-1.2
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,-1.5
five,2002,Nevada,2.9,-1.7
six,2003,Nevada,3.2,


In [None]:
#Adding more columns to dataframe

frame2['eastern'] = frame2.state == 'Ohio'# true false
frame2

Unnamed: 0,year,state,pop,debt,eastern
one,2000,Ohio,1.5,0,True
two,2001,Ohio,1.7,1,True
three,2002,Ohio,3.6,2,True
four,2001,Nevada,2.4,3,False
five,2002,Nevada,2.9,4,False
six,2003,Nevada,3.2,5,False


In [None]:
frame2['greaterThan2']= frame2['pop'] > 2
frame2

Unnamed: 0,year,state,pop,debt,eastern,greaterThan2
one,2000,Ohio,1.5,0,True,False
two,2001,Ohio,1.7,1,True,False
three,2002,Ohio,3.6,2,True,True
four,2001,Nevada,2.4,3,False,True
five,2002,Nevada,2.9,4,False,True
six,2003,Nevada,3.2,5,False,True


In [None]:
del frame2['eastern']

In [None]:
frame2

Unnamed: 0,year,state,pop,debt,greaterThan2
one,2000,Ohio,1.5,,False
two,2001,Ohio,1.7,-1.2,False
three,2002,Ohio,3.6,,True
four,2001,Nevada,2.4,-1.5,True
five,2002,Nevada,2.9,-1.7,True
six,2003,Nevada,3.2,,True


In [None]:
data = {'name':["Asad","Saad","Fahad", 'Ali'], 
        'age':[23,34,23,21], 
        'AiforEveryONe':[89,78,90,98],
        'python':[78,89,87,89],
        'git': [90,98,87,86],
        'numpy':[98,87,98,99]       }
        
data = pd.DataFrame(data)
data

Unnamed: 0,name,age,AiforEveryONe,python,git,numpy
0,Asad,23,89,78,90,98
1,Saad,34,78,89,98,87
2,Fahad,23,90,87,87,98
3,Ali,21,98,89,86,99


In [None]:
data['Total'] = data['AiforEveryONe']+data['python']+data['git']+data['numpy']
data['percent'] = data['Total']/400*100


In [None]:
#Another common form of data is a nested dict of dicts:
pop = {'Nevada': {2001: 2.4, 2002: 2.9},
       
       'Ohio':   {2000: 1.5, 2001: 1.7, 2002: 3.6}}
df3 =pd.DataFrame(pop)
df3

Unnamed: 0,Nevada,Ohio
2001,2.4,1.7
2002,2.9,3.6
2000,,1.5


<b>If the nested dict is passed to the DataFrame, pandas will interpret the outer dict keys
as the columns and the inner keys as the row indices</b>

In [None]:
df3.T

Unnamed: 0,2001,2002,2000
Nevada,2.4,2.9,
Ohio,1.7,3.6,1.5


In [None]:
df3

Unnamed: 0,Nevada,Ohio
2001,2.4,1.7
2002,2.9,3.6
2000,,1.5


In [None]:
pop1 =pd.DataFrame(pop, index=[2001, 2002, 2003])
pop1

Unnamed: 0,Nevada,Ohio
2001,2.4,1.7
2002,2.9,3.6
2003,,


In [None]:
pdata = {'Ohio': df3['Ohio'][:-1],
         
        'Nevada': df3['Nevada'][:2]
        }

pd.DataFrame(pdata)

Unnamed: 0,Ohio,Nevada
2001,1.7,2.4
2002,3.6,2.9


In [None]:
df3.index.name = 'year'
df3.columns.name = 'state'

df3

state,apples,oranges
year,Unnamed: 1_level_1,Unnamed: 2_level_1
0,10,11
1,20,21


# Index Objects

pandas’s Index objects are responsible for holding the axis labels and other metadata
(like the axis name or names). Any array or other sequence of labels you use when
constructing a Series or DataFrame is internally converted to an Index:

In [None]:

obj = pd.Series(range(3), index=['a', 'b', 'c'])

In [None]:
index =obj.index
index

Index(['a', 'b', 'c'], dtype='object')

In [None]:
index[1:]

Index(['b', 'c'], dtype='object')

In [None]:
index[1] = 'd'  # indices are immutable

TypeError: Index does not support mutable operations

In [None]:
labels = pd.Index(["a","b","c","d","e","f"]) # creatind an ndarray that is immutable
                                # coz created via Index function and index are immutable

In [None]:
labels

Index(['a', 'b', 'c', 'd', 'e', 'f'], dtype='object')

In [None]:
labels[0]="z"

TypeError: Index does not support mutable operations

In [None]:
print(frame)

    state  year  pop
0    Ohio  2000  1.5
1    Ohio  2001  1.7
2    Ohio  2002  3.6
3  Nevada  2001  2.4
4  Nevada  2002  2.9
5  Nevada  2003  3.2


In [None]:
frame.index=labels
frame

Unnamed: 0,state,year,pop
a,Ohio,2000,1.5
b,Ohio,2001,1.7
c,Ohio,2002,3.6
d,Nevada,2001,2.4
e,Nevada,2002,2.9
f,Nevada,2003,3.2


In [None]:
frame.index   # is index type object

Index(['a', 'b', 'c', 'd', 'e', 'f'], dtype='object')

In [None]:
frame.columns  # is also inde type object

Index(['state', 'year', 'pop'], dtype='object')

# Essential Functionality

In [None]:
frame2['debt']=np.arange(6)
print("The frame is", end="\n\n")

print(frame2,end="\n\n")

print("The row indices are", end="\n\n")

print(frame2.index,end="\n\n")

print("The col indeces are",end="\n\n")

print(frame2.columns)

The frame is

       year   state  pop  debt  greaterThan2
one    2000    Ohio  1.5     0         False
two    2001    Ohio  1.7     1         False
three  2002    Ohio  3.6     2          True
four   2001  Nevada  2.4     3          True
five   2002  Nevada  2.9     4          True
six    2003  Nevada  3.2     5          True

The row indices are

Index(['one', 'two', 'three', 'four', 'five', 'six'], dtype='object')

The col indeces are

Index(['year', 'state', 'pop', 'debt', 'greaterThan2'], dtype='object')


In [None]:
######by default row are reindexed via reindex function#####

reindex_frame = frame2.reindex(['five','two', 'three', 'six', 'four','one','seven'])


In [None]:
reindex_frame

Unnamed: 0,year,state,pop,debt,greaterThan2
five,2002.0,Nevada,2.9,4.0,True
two,2001.0,Ohio,1.7,1.0,False
three,2002.0,Ohio,3.6,2.0,True
six,2003.0,Nevada,3.2,5.0,True
four,2001.0,Nevada,2.4,3.0,True
one,2000.0,Ohio,1.5,0.0,False
seven,,,,,


The columns can be reindexed with the columns keyword:

In [None]:
reindex_frame = frame2.reindex(columns=['pop','year','imports', 'debt', 'state',"exports" ])

In [None]:
reindex_frame

Unnamed: 0,pop,year,imports,debt,state,exports
one,1.5,2000,,0,Ohio,
two,1.7,2001,,1,Ohio,
three,3.6,2002,,2,Ohio,
four,2.4,2001,,3,Nevada,
five,2.9,2002,,4,Nevada,
six,3.2,2003,,5,Nevada,


# Dropping Entries from an Axis


In [None]:
reindex_frame

Unnamed: 0,pop,year,imports,debt,state,exports
one,1.5,2000,,0,Ohio,
two,1.7,2001,,1,Ohio,
three,3.6,2002,,2,Ohio,
four,2.4,2001,,3,Nevada,
five,2.9,2002,,4,Nevada,
six,3.2,2003,,5,Nevada,


In [None]:
row_dropped_frame = reindex_frame.drop(['three','six'])   # not dropping inplace
                        # by default dropping row labels axis =0
row_dropped_frame

Unnamed: 0,pop,year,imports,debt,state,exports
one,1.5,2000,,0,Ohio,
two,1.7,2001,,1,Ohio,
four,2.4,2001,,3,Nevada,
five,2.9,2002,,4,Nevada,


In [None]:
col_dropped_frame = reindex_frame.drop(['imports','exports'],axis=1)
col_dropped_frame

Unnamed: 0,pop,year,debt,state
one,1.5,2000,0,Ohio
two,1.7,2001,1,Ohio
three,3.6,2002,2,Ohio
four,2.4,2001,3,Nevada
five,2.9,2002,4,Nevada
six,3.2,2003,5,Nevada


# Another Example:

In [None]:
index = ['Firefox', 'Chrome', 'Safari', 'IE10', 'Konqueror']
df = pd.DataFrame({
                      'http_status': [200,200,404,404,301],
                      'response_time': [0.04, 0.02, 0.07, 0.08, 1.0]},
                                                                      index=index)
df

Unnamed: 0,http_status,response_time
Firefox,200,0.04
Chrome,200,0.02
Safari,404,0.07
IE10,404,0.08
Konqueror,301,1.0


Create a new index and reindex the dataframe.
By default values in the new index that do not have corresponding records in the dataframe are assigned ``NaN``.



In [None]:
new_index= ['Safari', 'Iceweasel', 'Comodo Dragon', 'IE10','Chrome']
df.reindex(new_index)

Unnamed: 0,http_status,response_time
Safari,404.0,0.07
Iceweasel,,
Comodo Dragon,,
IE10,404.0,0.08
Chrome,200.0,0.02


We can fill in the missing values by passing a value to the keyword ``fill_value``. Because the index is not monotonically
increasing or decreasing, we cannot use arguments to the keyword
``method`` to fill the ``NaN`` values.

In [None]:
df.reindex(new_index, fill_value=0)

Unnamed: 0,http_status,response_time
Safari,404,0.07
Iceweasel,0,0.0
Comodo Dragon,0,0.0
IE10,404,0.08
Chrome,200,0.02


In [None]:
df.reindex(new_index, fill_value='missing')

Unnamed: 0,http_status,response_time
Safari,404,0.07
Iceweasel,missing,missing
Comodo Dragon,missing,missing
IE10,404,0.08
Chrome,200,0.02


In [None]:
#We can also reindex the columns.

df.reindex(columns=['http_status', 'user_agent'])

Unnamed: 0,http_status,user_agent
Firefox,200,
Chrome,200,
Safari,404,
IE10,404,
Konqueror,301,


In [None]:
# Or we can use "axis-style" keyword arguments
df.reindex(['http_status', 'user_agent'], axis="columns")

Unnamed: 0,http_status,user_agent
Firefox,200,
Chrome,200,
Safari,404,
IE10,404,
Konqueror,301,


To further illustrate the filling functionality in
``reindex``, we will create a dataframe with a
monotonically increasing index (for example, a sequence
of dates)

In [None]:
date_index = pd.date_range('1/1/2010', periods=6, freq='D')

df2 = pd.DataFrame({"prices": [100, 101, np.nan, 100, 89, 88]},index=date_index)
df2

Unnamed: 0,prices
2010-01-01,100.0
2010-01-02,101.0
2010-01-03,
2010-01-04,100.0
2010-01-05,89.0
2010-01-06,88.0


Suppose we decide to expand the dataframe to cover a wider
date range.

In [None]:
date_index2 = pd.date_range('12/29/2009', periods=10, freq='D')
df2.reindex(date_index2)

Unnamed: 0,prices
2009-12-29,
2009-12-30,
2009-12-31,
2010-01-01,100.0
2010-01-02,101.0
2010-01-03,
2010-01-04,100.0
2010-01-05,89.0
2010-01-06,88.0
2010-01-07,


The index entries that did not have a value in the original data frame
(for example, '2009-12-29') are by default filled with ``NaN``.
If desired, we can fill in the missing values using one of several
options.

For example, to back-propagate the last valid value to fill the ``NaN``
values, pass ``bfill`` as an argument to the ``method`` keyword.

In [None]:
df2.reindex(date_index2, method='bfill')

Unnamed: 0,prices
2009-12-29,100.0
2009-12-30,100.0
2009-12-31,100.0
2010-01-01,100.0
2010-01-02,101.0
2010-01-03,
2010-01-04,100.0
2010-01-05,89.0
2010-01-06,88.0
2010-01-07,


Please note that the ``NaN`` value present in the original dataframe
(at index value 2010-01-03) will not be filled by any of the
value propagation schemes. This is because filling while reindexing
does not look at dataframe values, but only compares the original and
desired indexes. If you do want to fill in the ``NaN`` values present
in the original dataframe, use the ``fillna()`` method.


# Indexing, Selection, and Filtering

In [None]:
data = pd.DataFrame(np.arange(40).reshape((10, 4)),
    index=['Ohio', 'Colorado', 'Washington','Nebraska','Utah', 'New York','California', 'Texas', 'Georgia', 'Alaska'],
    columns=['Jan', 'Feb', 'Mar', 'Apr'])
data

NameError: ignored

In [None]:
# getting a single col
data['Jan']

NameError: ignored

In [None]:
#getting multiple cols
data[['Jan', 'Apr']]

In [None]:
#integer based 
data[:2]  #slicing rows starts from 0 & take two rows

In [None]:
#label based
data["Utah":"Texas"]  #slicing rows starts from "Utah" & goto "Texas"

In [None]:
data[2:6,0:2]   # Slicing Subsets of Rows and Columns either by label index 
                # or by integer indexing is not possible, we have some other sol

In [None]:
data["Utah":"Texas", "Jan":'Mar']    # Slicing Subsets of Rows and Columns either by label index 
                                 # or by integer indexing isnot possible, we have some other sol

We can select specific ranges of our data in both the row and column directions using either label or integer-based indexing.

<b>loc</b> is primarily label based indexing. Integers may be used but they are interpreted as a label.

<b>iloc</b> is primarily integer based indexing
To select a subset of rows and columns from our DataFrame, we can use the iloc method.

In [None]:
# use if loc (label based)

data.loc["Utah":"Texas", "Jan":'Mar']


In [None]:
#use if iloc (integer based)

data.iloc[2:6,0:2] 

In [None]:
a = pd.DataFrame({"p":[2,4,6]})
a.rdiv(2)   # 2/6

In [None]:
# select all the data from the month of march that have value greater than 15

data['Mar'] > 15

In [None]:
data[data['Mar'] > 20]

In [None]:
data[data < 5] = 0
data

# Function Application and Mapping

In [None]:
frame = np.abs(
               pd.DataFrame(
                      np.random.randn(4, 3),
                      columns=list('bde'),
                      index=['Utah', 'Ohio', 'Texas', 'Oregon']))
frame

In [None]:
f = lambda x: x.max() - x.min()  # subtract the min value of each col from max of each col

In [None]:
frame.apply(f,axis='rows')  # row or 0 for each row wise 

In [None]:
frame.apply(f,axis='columns')  # columns or 1 for each col wise 

# What are Lambda Functions?

A <b><em>lambda<em></b> function is a small function containing a single expression. Lambda functions can also act as anonymous functions where they don’t require any name. These are very helpful when we have to perform small tasks with less code.

Lambda functions are handy and used in many programming languages but we’ll be focusing on using them in Python here. In Python, lambda functions have the following syntax:


![lambda.jpg](attachment:lambda.jpg)

# IIFEs using lambda functions
IIFEs are <i>Immediately Invoked Function Expressions</i>. These are functions that are executed as soon as they are created. IIFEs require no explicit call to invoke the function. In Python, IIFEs can be created using the lambda function.

Here, created an IIFE that returns the cube of a number:

In [None]:
(lambda x: x*x*x)(10)

In [None]:
#awsome

# Application of Lambda Functions with Different Functions

created a random dataset that contains information about a family of 5 people with their id, names, ages, and income per month. I will be using this dataframe to show you how to apply lambda functions using different functions on a dataframe in Python.

In [None]:
df=pd.DataFrame({
                'id':[1,2,3,4,5],
                'name':['Asad','Saad','Numi','Roman','Maria'],
                'age':[20,25,15,10,30],
                'income':[4000,7000,200,0,10000]})
df

# Application of Lambda with Apply

Let’s say we have got an error in the age variable. We recorded ages with a difference of 3 years. So, to remove this error from the Pandas dataframe, we have to add three years to every person’s age. We can do this with the <b>apply() function</b> in Pandas.

<b>apply() function</b>  in Pandas calls the lambda function and applies it to every row or column of the dataframe and returns a modified copy of the dataframe:

In [None]:
df['age']=df.apply(lambda x: x['age']+3,axis='columns')  # on frame

KeyError: ignored

In [None]:
df

Unnamed: 0,0,1,2,3
0,10,11,12,13
1,20,21,22,23


In [None]:
df['age']=df['age'].apply(lambda x: x+3) #on particular series

KeyError: ignored

In [None]:
df

Unnamed: 0,0,1,2,3
0,10,11,12,13
1,20,21,22,23


# Application of Lambda with Filter

Now, let’s see how many of these people are above the age of 18.

We can do this using the <b>filter() function</b>. 

The <b>filter() function</b> takes a lambda function and a Pandas series and applies the lambda function on the series and filters the data.



In [None]:
list(filter(lambda x: x>18, df['age']))

KeyError: ignored

# Application of Lambda with Map

You’ll be able to relate to the next statement. 🙂 It’s performance appraisal time and the income of all the employees gets increased by 20%. This means we have to increase the salary of each person by 20% in our Pandas dataframe.

We can do this using the map() function. This map() function maps the series according to input correspondence. It is very helpful when we have to substitute a series with other values.

In [None]:
df['income']=list(map(lambda x: int(x+x*0.2),df['income']))

NameError: ignored

In [None]:
df

NameError: ignored

In [None]:
df['income2'] = df['income'].apply(lambda x: x+x*.2)

NameError: ignored

In [None]:
df

# Conditional Statements using Lambda Functions

Lambda functions also support conditional statements, such as if..else. This makes lambda functions very powerful.

Let’s say in the family dataframe we have to categorize people into ‘Adult’ or ‘Child’. For this, we can simply apply the lambda function to our dataframe:

In [None]:
df['category']=df['age'].apply(lambda x: 'Adult' if x>=18 else 'Child')

NameError: ignored

In [None]:
df

NameError: ignored

# Lambda with Reduce
Now, let’s see the total income of the family. To calculate this, we can use the reduce() function in Python. It is used to apply a particular function to the list of elements in the sequence. The reduce() function is defined in the ‘functools’ module.

For using the reduce() function, we have to import the functools module first:

In [None]:
import functools
functools.reduce(lambda a,b: a+b,df['income'])

KeyError: ignored

# Summarizing and Computing Descriptive Statistics

In [None]:
#do your self

# Correlation and Covariance

study link:https://machinelearningmastery.com/how-to-use-correlation-to-understand-the-relationship-between-variables/

Example:
    

In [None]:
import pandas_datareader.data as web
!pip install pandas_datareader



In [None]:
# dictionary comprehension

all_data = {ticker: web.get_data_yahoo(ticker) for ticker in ['AAPL', 'IBM', 'MSFT', 'GOOG']}

In [None]:
all_data

In [None]:
all_data['AAPL']