<a href="https://colab.research.google.com/github/Muhammad-Usama-07/Data-Science-Journey/blob/main/Pandas/PandasWork.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# What is Pandas in python?

pandas is a software library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series.
To get started with pandas, you will need to get comfortable with its two workhorse data structures: Series and DataFrame. While they are not a universal solution for every problem, they provide a solid, easy-to-use basis for most applications.

# *Working with Series in Pandas

## Pandas Series Object

A Series is the primary building block of pandas.

Series represents a one-dimensional labeled indexed array based on the NumPy ndarray.
Like an array, a Series can hold zero or more values of any single data type

### Part-(a). Creating Series

 Series can be created and initialized by passing either a scalar value, a NumPy ndarray, a Python list, or a Python Dict as the data parameter of the Series constructor. This is the default parameter and does not need to be specified if it is the first item.

In [None]:
import numpy as np 
import pandas as pd 

#### 1. Create one item Series

In [None]:
s1 = pd.Series(3)
s1

0    3
dtype: int64

#### 2. create a series of multiple items using a list

In [None]:
ser_mul = pd.Series([1,2,3,4,5])
ser_mul

0    1
1    2
2    3
3    4
4    5
dtype: int64

#### 3. getting the values in the Series

In [None]:
ser_mul.values

array([1, 2, 3, 4, 5], dtype=int64)

#### 4. getting the index of the Series

In [None]:
ser_mul.index

RangeIndex(start=0, stop=5, step=1)

#### 5. Explicitly Creating index(which are string but not integar)

In [None]:
ser2_str_index = pd.Series([1,2,3,4], index=['a','b','c','d'])
ser2_str_index

a    1
b    2
c    3
d    4
dtype: int64

#### 6. Accesssing value by label, Accesssing value by integar 

In [None]:
print(f"By Lable, The value is: {ser2_str_index['d']} ")
print(f"By Lable, The value is: {ser2_str_index[3]} ")

By Lable, The value is: 4 
By Lable, The value is: 4 


#### 7. create Series from an existing index(scalar value with be copied at each index label)

In [None]:
ser3_str = pd.Series(['A','B','C','D'], index = ser2_str_index.index)
ser3_str

a    A
b    B
c    C
d    D
dtype: object

#### 8. creating Series from dict

In [None]:
ser_dict = pd.Series({'a':1,
                      'b':2,
                      'c':3,
                      'd':4})
ser_dict

a    1
b    2
c    3
d    4
dtype: int64

#### 9. Creating Series from numpy array

In [None]:
seri_num = pd.Series(np.array([11,33,22,44]))
seri_num

0    11
1    33
2    22
3    44
dtype: int32

#### 10. Checking Size, shape, uniqueness, and counts of values

In [None]:
# example series, which also contains a NaN(not a number)
# numpy nan property is used here to create an NaN 
s = pd.Series([0, 1, 1, 2, 3, 4, 4, 5, 6, 7,np.nan]) 
s

0     0.0
1     1.0
2     1.0
3     2.0
4     3.0
5     4.0
6     4.0
7     5.0
8     6.0
9     7.0
10    NaN
dtype: float64

In [None]:
print(len(s))
print(s.size)
print(s.shape)
print(s.count())
print(s.unique())
print(s.value_counts())

11
11
(11,)
10
[ 0.  1.  2.  3.  4.  5.  6.  7. nan]
4.0    2
1.0    2
7.0    1
6.0    1
5.0    1
3.0    1
2.0    1
0.0    1
dtype: int64


#### 11. Peeking at data with heads, tails, and take

In [None]:
# first five values
s.head()

0    0.0
1    1.0
2    1.0
3    2.0
4    3.0
dtype: float64

In [None]:
# last five values
s.tail()

6     4.0
7     5.0
8     6.0
9     7.0
10    NaN
dtype: float64

In [None]:
# first three values
s.head(n = 3)

0    0.0
1    1.0
2    1.0
dtype: float64

In [None]:
# last three values
s.tail(n = 3)

8     6.0
9     7.0
10    NaN
dtype: float64

In [None]:
s.take([2,6,4])

2    1.0
6    4.0
4    3.0
dtype: float64

#### 12. Lable lookup(get multiple item from Series)

In [None]:
ser2_str_index[['c','a']]

c    3
a    1
dtype: int64

#### 13. Position Bases lookup(get multiple item from Series)

In [None]:
ser2_str_index[[1,3]]

b    2
d    4
dtype: int64

#### 14. Series with an integer index, but not starting with 0

In [None]:
seri_int = pd.Series([1,2,3,4], index= [3,4,5,6])
seri_int

3    1
4    2
5    3
6    4
dtype: int64

#### 15. label-based lookup versus position-based lookup

In [None]:
seri_int[5]  # 2 is considered as label based look up
       # coz label also has 2 init

3

In [None]:
seri_int[0]   # now see in this case we have integer label lookup,position lookup is not working

KeyError: 0

#### 16. Using loc and iloc

In [None]:
 # loc also works on label based look up
print(seri_int.loc[3])
print(seri_int.loc[4])
print(seri_int.loc[5])
print(seri_int.loc[6])


1
2
3
4


In [None]:
print(seri_int.loc[0])
print(seri_int.loc[1])
print(seri_int.loc[2])
print(seri_int.loc[3])


KeyError: 0

In [None]:
#iLoc forcefully works on position based look up even you dont specify position based index
print(seri_int.iloc[3])
print(seri_int.iloc[4])
print(seri_int.iloc[5])
print(seri_int.iloc[6])


4


IndexError: single positional indexer is out-of-bounds

In [None]:
print(seri_int.iloc[0])
print(seri_int.iloc[1])
print(seri_int.iloc[2])
print(seri_int.iloc[3])


1
2
3
4


#### 17. Alignment via index labels

A fundamental difference between a NumPy ndarray and a pandas Series is the ability of a Series to automatically align data from another Series based on label values before performing an operation.

In [None]:
s2 = pd.Series([1,2,3,4], index=['a','b','c','d'])
s2

a    1
b    2
c    3
d    4
dtype: int64

In [None]:
s3 = pd.Series([4,3,2,1], index=['d','c','b','a'])
s3

d    4
c    3
b    2
a    1
dtype: int64

In [None]:
s2+s3

a    2
b    4
c    6
d    8
dtype: int64

In [None]:
s3+s2

a    2
b    4
c    6
d    8
dtype: int64

#### 18. Concatinate Series(Using Dictionary and extra item from another Series)

Nan + number = NaN 
(NaN added to a number results in NaN)

number + NaN = NaN
(Number added to a Nan results in NaN)

In [None]:
s4 = pd.Series({'a':1,
                'b':2,
                'c':3,
                'd':4})
s4

a    1
b    2
c    3
d    4
dtype: int64

In [None]:
s5 = pd.Series({'c':1,
                'e':2,
                'd':3,
                'f':4})
s5

c    1
e    2
d    3
f    4
dtype: int64

In [None]:
# NaN's result for a and e
# demonstrates alignment
s4+s5

a    NaN
b    NaN
c    4.0
d    7.0
e    NaN
f    NaN
dtype: float64

#### 19. Concatinating Series(if the series has duplicate index)

In [None]:
s6 = pd.Series([1.0,2.0,3.0], index=['a','a','b'])
s6

a    1.0
a    2.0
b    3.0
dtype: float64

In [None]:
s7 = pd.Series([1.0,2.0,3.0], index=['a','a','c'])
s7

a    1.0
a    2.0
c    3.0
dtype: float64

In [None]:
s6+s7

a    2.0
a    3.0
a    3.0
a    4.0
b    NaN
c    NaN
dtype: float64

![download.png](attachment:download.png)

#### 20. Handling(the case of Not-A-Number) with Series

In [None]:
# simple mean of numpy array values
nda = np.array([1, 2, 3, 4, 5])
nda.mean()

3.0

In [None]:
# mean of numpy array values with a NaN
nda = np.array([1, 2, 3, 4, np.NaN])
nda.mean()

nan

In [None]:
# But Panda's Series ignores NaN values
s8 = pd.Series(nda)      
s8.mean()

2.5

#### 21.  Series handle NaN values like NumPy

In [None]:
s8.mean(skipna=False)

nan

#### 22. ***Working Boolean Selection

In [None]:
s9 = pd.Series(np.arange(0,10))
s9 > 3

0    False
1    False
2    False
3    False
4     True
5     True
6     True
7     True
8     True
9     True
dtype: bool

#### 23. select rows where values are > 5

In [None]:
logicalResults = s9 > 5
s9[logicalResults]

6    6
7    7
8    8
9    9
dtype: int32

#### 24. A little shorter version to Select

In [None]:
s9[s9>5]

6    6
7    7
8    8
9    9
dtype: int32

#### 25. Select specific range of rows from series

In [None]:
# commented as it throws an exception
# s[s > 5 and s < 8]

# correct syntax
s9[(s9 > 5) & (s9 < 8)]

6    6
7    7
dtype: int32

#### 26. Getting sum of values by giving boolean indexing 

In [None]:
(np.array([True,False,True,True])).sum()

3

#### 27. Get all item those satisfy the condition 

In [None]:
# are all items >= 0?
(s9 >= 0).all()

True

#### 28. Get any of those item that satisfy the condition

In [None]:
# any items < 2?
(s9[s9 < 2].any())

True

#### 29. count values that satisfy the condition

In [None]:
# how many values < 2?
(s9 < 2).count()

10

### 30. Reindexing a Series
Reindexing in pandas is a process that makes the data in a Series or DataFrame match a given set of labels. This is core to the functionality of pandas as it enables label alignment across multiple objects, which may originally have different indexing schemes. This process of performing a reindex includes the following steps:

1. Reordering existing data to match a set of labels.
2. Inserting NaN markers where no data exists for a label.
3. Possibly, filling missing data for a label using some type of logic (defaulting to adding NaN values).

#### 31. Changing indexes of a series 

In [None]:
#Creating Sample Series
s10 = pd.Series(np.random.randn(5))
s10

0    1.505688
1    0.285284
2   -2.395424
3    0.167540
4    1.288886
dtype: float64

In [None]:
# Changing Indexes
s10.index=['a','b','c','d','e']
s10

a    1.505688
b    0.285284
c   -2.395424
d    0.167540
e    1.288886
dtype: float64

#### 32. Concatinating Series
let's examine a slightly more practical example. The following code concatenates two Series objects resulting in duplicate index labels, which may not be desired in the resulting Series:

In [None]:
s11 = pd.Series(np.random.randn(3))
s11

0   -1.900792
1    0.167370
2    0.005499
dtype: float64

In [None]:
s12 = pd.Series(np.random.randn(3))
s12

0   -1.825490
1    0.350967
2    1.149903
dtype: float64

In [None]:
comb_Seri = pd.concat([s11,s12])
comb_Seri

0   -1.900792
1    0.167370
2    0.005499
0   -1.825490
1    0.350967
2    1.149903
dtype: float64

#### 33. Reset index(duplication of index may be remove)
Reindexing using the .index property in-place(CPU-memory) modifies the Series.

In [None]:
comb_Seri.index = np.arange(0, len(comb_Seri))
comb_Seri

0   -1.900792
1    0.167370
2    0.005499
3   -1.825490
4    0.350967
5    1.149903
dtype: float64

#### 34. Using reindex() method to reindex
Greater flexibility in creating a new index is provided using the .reindex() method. An example of the flexibility of .reindex() over assigning the .index property directly is that the list provided to .reindex() can be of a different length than the number of rows in the Series:

In [None]:
s13 = pd.Series(np.random.randn(3),['a','b','d'])
print(s13)

s14 = s13.reindex(['a','b','c'])
print(s14)

a   -1.135632
b    1.212112
d   -0.173215
dtype: float64
a   -1.135632
b    1.212112
c         NaN
dtype: float64


Things to be noted:

1. reindex() donot re-index inplace, it will return a new series original will not be modified
2. if any index not matching the previous index will be assigned NaN
3. The index present in previous indexes, if not included in re-index 
    then the row will not be added in new series. 

#### 35. Concatinating string and integar labels

In [None]:
s15 = pd.Series([0, 1, 2], index=[0, 1, 2])
s16 = pd.Series([3, 4, 5], index=['0', '1', '2'])
s15 + s16

0   NaN
1   NaN
2   NaN
0   NaN
1   NaN
2   NaN
dtype: float64

In [None]:
# reindex by casting the label types and we will get the desired result

s16.index = s16.index.values.astype(int)
s15 + s16

0    3
1    5
2    7
dtype: int64

#### 36. fill with Some Value instead of NaN
Overriding the default action of inserting NaN as a missing value during reindexing can be changed by using the fill_value parameter of the method.

In [None]:
s17_fill_val = s15.reindex([1,3], fill_value=0)
s17_fill_val

1    1
3    0
dtype: int64

#### 37. ForwardFill BackWordFill and nearest
ffill, bfill, & nearest

In [None]:
s18 = pd.Series(['apple','Mango','Banana'], index=[0,4,7])
print(s18)

0     apple
4     Mango
7    Banana
dtype: object


In [None]:
# Forward Fill
s18.reindex(np.arange(0,10),method='ffill')

0     apple
1     apple
2     apple
3     apple
4     Mango
5     Mango
6     Mango
7    Banana
8    Banana
9    Banana
dtype: object

In [None]:
# backward Fill
s18.reindex(np.arange(0,10),method='bfill')

0     apple
1     Mango
2     Mango
3     Mango
4     Mango
5    Banana
6    Banana
7    Banana
8       NaN
9       NaN
dtype: object

In [None]:
# nearest Fill
s18.reindex(np.arange(0,10),method='nearest')

0     apple
1     apple
2     Mango
3     Mango
4     Mango
5     Mango
6    Banana
7    Banana
8    Banana
9    Banana
dtype: object

### 38. Slicing a Series

In [None]:
# a Series to use for slicing
# using index labels not starting at 0 to demonstrate
# position based slicing
s19 = pd.Series(np.arange(50,61), index=np.arange(10,21))
print('original series')
print(s19)
print('sliced series')
print(s19[2:6:1])

original series
10    50
11    51
12    52
13    53
14    54
15    55
16    56
17    57
18    58
19    59
20    60
dtype: int32
sliced series
12    52
13    53
14    54
15    55
dtype: int32


#### 39. first five by slicing, same as .head(5)

In [None]:
s19[:5]

10    50
11    51
12    52
13    53
14    54
dtype: int32

#### 40. Missing Data in Series
NaN values represent data is missing in the series

In [None]:
sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
states = ['California', 'Ohio', 'Oregon', 'Texas']

s20 = pd.Series(sdata, index=states)
s20

California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

#### 41. Check if series has null values

In [None]:
pd.isnull(s20)  # obj4.isnull()

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

#### 42. Check if series has not null values(return true)

In [None]:
pd.notnull(s20)

California    False
Ohio           True
Oregon         True
Texas          True
dtype: bool

## 43. The pandas DataFrame Object
A DataFrame represents a rectangular table of data and contains an ordered collection of columns, each of which can be a different value type (numeric, string, boolean, etc.).

#### 44. Creating a DataFrame from scratch

In [None]:
# create a DataFrame from a 2-d ndarray
# default row and columns indexes
df = pd.DataFrame(np.array([[1,2,3,4,5],[2,4,5,7,8]]))
df

Unnamed: 0,0,1,2,3,4
0,1,2,3,4,5
1,2,4,5,7,8


#### 45. Create a DataFrame for a list of Series objects

In [None]:
df2 = pd.DataFrame([pd.Series(np.arange(10,15)),pd.Series(np.arange(20,25))])
df2

Unnamed: 0,0,1,2,3,4
0,10,11,12,13,14
1,20,21,22,23,24


#### 46. Create a DataFrame with two Series objects in dictionary
 and assigning different column name

In [None]:
s21 = pd.Series(np.arange(0,5))
s22 = pd.Series(np.arange(5,10))
df3 = pd.DataFrame({'item1':s21,'item2':s22})
df3

Unnamed: 0,item1,item2
0,0,5
1,1,6
2,2,7
3,3,8
4,4,9


#### 47. Create a DataFrame with named columns and rows

In [None]:
df4 = pd.DataFrame(np.array([[1,2,3,4,5],[6,7,8,9,1]]),index=['orange','Mango'],columns=['Mon','Tue','Wed','Thus','Fri'])
df4

Unnamed: 0,Mon,Tue,Wed,Thus,Fri
orange,1,2,3,4,5
Mango,6,7,8,9,1


#### 48. Demonstrate alignment during creation

In [None]:
s23 = pd.Series(np.arange(15,17),index=[0,2])
df5 = pd.DataFrame({'a1':s21, 'a2':s22, 'a3':s23})
df5

Unnamed: 0,a1,a2,a3
0,0,5,15.0
1,1,6,
2,2,7,16.0
3,3,8,
4,4,9,


#### 49. Creating data frames(using dictionary)

In [None]:
dataf = {'Mango':[3,5,6,8,4,7,7],
         'Orange':[9,8,6,9,9,7,7],
         'Packet':[8,7,9,9,8,7,6]}
df6 = pd.DataFrame(dataf)
df6

Unnamed: 0,Mango,Orange,Packet
0,3,9,8
1,5,8,7
2,6,6,9
3,8,9,9
4,4,9,8
5,7,7,7
6,7,7,6


#### 50. Accessing Each Column

In [None]:
df6.Mango

0    3
1    5
2    6
3    8
4    4
5    7
6    7
Name: Mango, dtype: int64

In [None]:
df6['Mango']

0    3
1    5
2    6
3    8
4    4
5    7
6    7
Name: Mango, dtype: int64

#### 51. Adding new column to old dataframe

In [None]:
df7 = pd.DataFrame(dataf,columns=['Mango','Orange','Packet','Dept'], index=['one','two','three','four','five','six','seven'],)
df7

Unnamed: 0,Mango,Orange,Packet,Dept
one,3,9,8,
two,5,8,7,
three,6,6,9,
four,8,9,9,
five,4,9,8,
six,7,7,7,
seven,7,7,6,


#### 52. Assigning value to new column

In [None]:
df7.Dept = 30
df7

Unnamed: 0,Mango,Orange,Packet,Dept
one,3,9,8,30
two,5,8,7,30
three,6,6,9,30
four,8,9,9,30
five,4,9,8,30
six,7,7,7,30
seven,7,7,6,30


#### 53. Assigning value to dataframe using Series

In [None]:
df7.Dept = pd.Series([1.2,-3.2,-5.6],index=['two','three','four'])
df7

Unnamed: 0,Mango,Orange,Packet,Dept
one,3,9,8,
two,5,8,7,1.2
three,6,6,9,-3.2
four,8,9,9,-5.6
five,4,9,8,
six,7,7,7,
seven,7,7,6,


#### 54. Adding new column with boolean value

In [None]:
df7['Estimate'] = df7.Orange == 9
df7

Unnamed: 0,Mango,Orange,Packet,Dept,Estimate
one,3,9,8,,True
two,5,8,7,1.2,False
three,6,6,9,-3.2,False
four,8,9,9,-5.6,True
five,4,9,8,,True
six,7,7,7,,False
seven,7,7,6,,False


#### 55. Deleting Column

In [None]:
del df7['Estimate']
df7

Unnamed: 0,Mango,Orange,Packet,Dept
one,3,9,8,
two,5,8,7,1.2
three,6,6,9,-3.2
four,8,9,9,-5.6
five,4,9,8,
six,7,7,7,
seven,7,7,6,


#### 56. Adding data in the form of nested dictionary 

In [None]:
dictodic = {'Apple':{2010:566,2012:600},
            'Banana':{2010:466,2012:700}}
df8 = pd.DataFrame(dictodic)
df8

Unnamed: 0,Apple,Banana
2010,566,466
2012,600,700


#### 57.  Making Row as column and column as row (Taking Transpose)

In [None]:
df8.T

Unnamed: 0,2010,2012
Apple,566,600
Banana,466,700


# Perform Aggregation in Dataframe

## Summary Statistics

### find summaries on single column

In [None]:
def action_function(column):
  # return column.min()
  # return column.max()
  return column.quantile(0.3)
df['column_name'].agg(action_function)

### Apply same summary on multiple column

In [None]:
def action_function(column):
  # return column.min()
  # return column.max()
  return column.quantile(0.3)
df[['column1_name', 'column2_name']].agg(action_function)

### Apply multiply summaries on single column

In [None]:
def action_function1(column):
  # return column.min()
  # return column.max()
  return column.quantile(0.3)
def action_function2(column):
  # return column.min()
  # return column.max()
  return column.quantile(0.3)
df['column1_name'].agg(action_function1, action_function1)