#### <b>Pandas</b>
<font size=2>
Pandas contains high-level data structures and manipulation tools designed to make data
analysis fast and easy in Python. pandas is built on top of NumPy and makes it easy to
use in NumPy-centric applications.

Pandas is widely used in data science, machine learning, and data analysis due to its ability to handle large datasets and perform complex data operations efficiently.
</font>

In [1]:
import pandas as pd
import numpy as np      # importing numpy just in case

<font size=3>
Two major data types in Pandas
</font>
<font size=2>

- <b>DataFrame:</b> A two-dimensional, size-mutable, and heterogeneous tabular data structure with labeled axes (rows and columns). It is similar to a spreadsheet or SQL table.
- <b>Series:</b> A one-dimensional, labeled array that can hold any data type. It is like a single column in a DataFrame.

Let's get started with

</font>
<br>
<b>Series</b>
<font size=2>Series is formed from only an array of data:</font>

In [2]:
s1 = pd.Series([4, 7, -5, 3])
s1

0    4
1    7
2   -5
3    3
dtype: int64

<font size=2>
The string representation of a Series displayed interactively shows the index on the left
and the values on the right. Since we did not specify an index for the data, a default
one consisting of the integers 0 through N - 1 (where N is the length of the data) is
created. You can get the array representation and index object of the Series via its values
and index attributes, respectively:
</font>

In [3]:
print(s1.values)
s1.index

[ 4  7 -5  3]


RangeIndex(start=0, stop=4, step=1)

<font size=2>since the above index attribute is returning a range index object we can loop over it</font>

In [4]:
for i in s1.index:
    print(i, end=" ")

0 1 2 3 

we can assign our own index for unique identification of each element

In [5]:
s2 = pd.Series([4, 7, -5, 3], index=['a','b','c','d'])
s2

a    4
b    7
c   -5
d    3
dtype: int64

<font size=2>
Compared with a regular NumPy array, you can use values in the index when selecting
single values or a set of values
</font>

In [6]:
print("Accessing element: ",s2['a'],"\n") # accessing the element from series using index
s2['d'] = 6     # assiging element using index
print("Accessing elements using multiple indexes:")
s2[['c', 'a', 'd']] # accessing multiple element using an 
                    # array of indexes

Accessing element:  4 

Accessing elements using multiple indexes:


c   -5
a    4
d    6
dtype: int64

<font size=2>
NumPy array operations, such as filtering with a boolean array, scalar multiplication,
or applying math functions, will preserve the index-value link:

</font>

In [7]:
s2[s2 > 0]

a    4
b    7
d    6
dtype: int64

In [8]:
s2 * 2

a     8
b    14
c   -10
d    12
dtype: int64

In [9]:
np.exp(s2)

a      54.598150
b    1096.633158
c       0.006738
d     403.428793
dtype: float64

<font size=2>It can be substituted into many functions that expect a
dict:</font>

In [10]:
print('b' in s2)
print('e' in s2)

True
False


<font size=2>you can create a series from a dictionary as well</font>

In [11]:
sdata = {'Ohio': 35000,
         'Texas': 71000,
         'Oregon': 16000,
         'Utah': 5000}
s3 = pd.Series(sdata)
s3

Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64

<font size=2>what if there are some values not present in series</font>

In [12]:
data = data = {
    'A': 10,
    'B': None,
    'C': 30,
    'D': None,
    'E': 50
}
null_data = pd.Series(data)
print(null_data)

A    10.0
B     NaN
C    30.0
D     NaN
E    50.0
dtype: float64


<font size=2>Above we can see that there is some data not present in series and it is present as <b>None</b> in data but in pandas series it shows <b>NaN</b>.

Check if there is some null value present in data series or not we use isnull() method
</font>

In [13]:
null_data.isnull()

A    False
B     True
C    False
D     True
E    False
dtype: bool

<font size=2><b>sum()</b> method can also be used along with <b>isnull()</b> method to check the number of null values</font>

In [14]:
print("Null values in null_data: ",null_data.isnull().sum())

Null values in null_data:  2


<font size=2>Both the Series object itself and its index have a name attribute, which integrates with
other key areas of pandas functionality:</font>

In [15]:
null_data.name = "rank"
null_data.index.name = "grade"

null_data

grade
A    10.0
B     NaN
C    30.0
D     NaN
E    50.0
Name: rank, dtype: float64

<b>DataFrame</b>
<font size=2>

A DataFrame represents a tabular, spreadsheet-like data structure containing an ordered collection of columns, each of which can be a different value type (numeric,
string, boolean, etc.).
 
The DataFrame has both a row and column index; it can be
thought of as a dict of Series (one for all sharing the same index).
</font>

Creation
<font size=2>

There are numerous ways to construct a DataFrame, though one of the most common
is from a dict of equal-length lists or NumPy arrays</font>


In [16]:
df = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
 'year': [2000, 2001, 2002, 2001, 2002],
 'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}
frame = pd.DataFrame(df)

frame

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9


<font size=2>
If you specify a sequence of columns, the DataFrame’s columns will be exactly what
you pass:
</font>

In [17]:
pd.DataFrame(df, columns=['year', 'state', 'pop'])

Unnamed: 0,year,state,pop
0,2000,Ohio,1.5
1,2001,Ohio,1.7
2,2002,Ohio,3.6
3,2001,Nevada,2.4
4,2002,Nevada,2.9


<font size=2>
As with Series, if you pass a column that isn’t contained in data, it will appear with NA
values in the result:
</font>

In [18]:
frame2 = pd.DataFrame(df, 
                      columns=['year', 'state', 'pop', 'debt'],
                      index=['one', 'two', 'three', 'four', 'five'])
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,
five,2002,Nevada,2.9,


<font size=2>
A column in a DataFrame can be retrieved as a Series either by dict-like notation or by
attribute
</font>

In [19]:
display(frame2['year'])
display(frame2.state)

one      2000
two      2001
three    2002
four     2001
five     2002
Name: year, dtype: int64

one        Ohio
two        Ohio
three      Ohio
four     Nevada
five     Nevada
Name: state, dtype: object

<font size=2>rows can also be accessed by position or name using <b>.loc()</b> or <b>.iloc()</b> methods.

- .loc uses user defined names or indexes to access the data

- .iloc uses system defined numerical indexing(0,1,2...) to access data
</font>

In [39]:
frame2.iloc[0]

year     2000
state    Ohio
pop       1.5
debt      NaN
Name: one, dtype: object

In [41]:
frame2.loc['four']

year       2001
state    Nevada
pop         2.4
debt        NaN
Name: four, dtype: object

<font size=2>columns can be assigned values by assignment operators</font>

In [42]:
frame2.debt = 10
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,10
two,2001,Ohio,1.7,10
three,2002,Ohio,3.6,10
four,2001,Nevada,2.4,10
five,2002,Nevada,2.9,10


<font size=2>When assigning lists or arrays to a column, the value’s length must match the length
of the DataFrame. If you assign a Series, it will be instead conformed exactly to the
DataFrame’s index, inserting missing values in any holes:</font>

In [43]:
val = pd.Series([-1.2, -1.5, -1.7], index=['two', 'four', 'five'])
frame2['debt'] = val
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,-1.2
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,-1.5
five,2002,Nevada,2.9,-1.7


<font size=2>Assigning a column that doesn’t exist will create a new column. The del keyword will
delete columns as with a dict:</font>

In [44]:
frame2['eastern'] = frame2.state == 'Ohio'
frame2


Unnamed: 0,year,state,pop,debt,eastern
one,2000,Ohio,1.5,,True
two,2001,Ohio,1.7,-1.2,True
three,2002,Ohio,3.6,,True
four,2001,Nevada,2.4,-1.5,False
five,2002,Nevada,2.9,-1.7,False


In [46]:
del frame2['eastern']
frame2.columns

Index(['year', 'state', 'pop', 'debt'], dtype='object')

<font size=2><b>Another common form of data is a nested dict of dicts format:</b></font>

In [48]:
pop = {'Nevada': {2001: 2.4, 2002: 2.9},
       'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}

<font size=2>If passed to DataFrame, it will interpret the outer dict keys as the columns and the inner
keys as the row indices:</font>

In [49]:
frame3 = pd.DataFrame(pop)
frame3

Unnamed: 0,Nevada,Ohio
2001,2.4,1.7
2002,2.9,3.6
2000,,1.5


<font size=2>Like Series, the values attribute returns the data contained in the DataFrame as a 2D
ndarray:</font>

In [50]:
frame3.values

array([[2.4, 1.7],
       [2.9, 3.6],
       [nan, 1.5]])

In [51]:
frame2.values

array([[2000, 'Ohio', 1.5, nan],
       [2001, 'Ohio', 1.7, -1.2],
       [2002, 'Ohio', 3.6, nan],
       [2001, 'Nevada', 2.4, -1.5],
       [2002, 'Nevada', 2.9, -1.7]], dtype=object)

<b>Index Objects</b>

pandas’s Index objects are responsible for holding the axis labels and other metadata
(like the axis name or names). Any array or other sequence of labels used when constructing a Series or DataFrame is internally converted to an Index:

In [54]:
obj = pd.Series(range(3), index=['a', 'b', 'c'])

In [56]:
index = obj.index
index

Index(['a', 'b', 'c'], dtype='object')

In [57]:
index[1:]

Index(['b', 'c'], dtype='object')

<font size=2>Index objects are immutable and thus can’t be modified by the user.

In addition to being array-like, an Index also functions as a fixed-size set:
<font>

In [58]:
frame3

Unnamed: 0,Nevada,Ohio
2001,2.4,1.7
2002,2.9,3.6
2000,,1.5


In [59]:
print('Ohio' in frame3.columns)
print(2003 in frame3.index)

True
False
