<a href="https://colab.research.google.com/github/HimanshuSharma0711/Python-Notes/blob/main/Pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Pandas:** Pandas is a python package providing fast, flexible and expressive data structures designed to make working with structured(tabular, multidimensional, potentially heterogeneous(different datatypes)) and time series data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real world data analysis in python.

  Pandas is designed to make it easier to work with structured data. Most of the analysis you might perform will likely to involve using tabular data, e.g. from .csv files or relational databases (eg. SQL). The DataFrame object in pandas is "a two-dimensional tabular, column-oriented data structure with row and column labels."

**Here are just a few things that pandas does well:**

1.  Robust IO tools for loading data from flat files (CSV and delimited), Excel files, databases, and saving/loading
    data from the ultrafast HDF5 format.

2.  Easy handling of missing data (represented as NaN) is floating point as well as non-floating point data.

3.  Size Mutability: columns can be inserted or deleted for dataframe and higher dimensional object.

4.  Automatic and explicit data alignment: Objects can be explicitly aligned to set of labels, or the user can simply
    ignore the labels and let series, DataFrames etc. automatically align the data for you in computations.

5.  Powerful, flexible group by functionality to perform split-apply-combine operation on data sets, for both
    aggregated and transforming data.

6.  Make it easy to convert ragged, differently-indexed data in other python and Numpy data structures into DataFrame
    objects.

7.  Intelligent label-based slicing, fancy indexing and sub-setting of large data sets

8.  Intutive merging and joining data sets.

9.  Flexible shaping and pivoting data of data sets.

10. Hierarchical labeling of axes (possible to have multiple labels per tick)

11. Time series-specific functionality: date range generation and frequency conversion, moving window statistics, moving
    window linear regressions, date shifting and lagging, etc.

In [1]:
import pandas as pd
import numpy as np

print(pd.__version__)

#First Series
num_ser= pd.Series([2,3,4,5,6,7,8])
print(num_ser)

#Sameseries with different Indices
num_ser= pd.Series([1,2,3,4,5,6,7,8],index=['a','b','c','d','e','f','g','h'],name='Number_Series')
print(num_ser)
print(num_ser[0])
print(num_ser['a'])

1.5.3
0    2
1    3
2    4
3    5
4    6
5    7
6    8
dtype: int64
a    1
b    2
c    3
d    4
e    5
f    6
g    7
h    8
Name: Number_Series, dtype: int64
1
1


In [2]:
import pandas as pd
import numpy as np

#Creating Dataframe using 2D numpy Array
array_2D = np.random.randint(20,100,(5,4))
df = pd.DataFrame(array_2D,columns=('Aman','Sanjay','Ranu','Shushil'),index=('Maths','English','Hindi','Science','Sst'))
print(df)


         Aman  Sanjay  Ranu  Shushil
Maths      77      57    33       69
English    73      91    31       65
Hindi      62      42    91       34
Science    60      35    81       67
Sst        81      76    83       39


In [3]:
import pandas as pd
import numpy as np

#Using a conventional list of lists
lst = [['Aman',16],['Sanjay',22],['Ranu',26],['Shushil', 32]]

df = pd.DataFrame(lst,columns=['Name','Age'])
print(df)


      Name  Age
0     Aman   16
1   Sanjay   22
2     Ranu   26
3  Shushil   32


In [4]:
import pandas as pd
import numpy as np

#Using a Dictionary
emp_dic= {'Employee Name': ['Ravi','Ashish','Mohan','Akash'],'Salary':[18000,25000,22000,32000]}
df = pd.DataFrame(emp_dic)
print(df)


  Employee Name  Salary
0          Ravi   18000
1        Ashish   25000
2         Mohan   22000
3         Akash   32000


In [5]:
import pandas as pd
import numpy as np

series_dict = {'First Series': pd.Series([10,20,30,40,50]),
               'Second Series': pd.Series([60,70,80,90,100])}
df = pd.DataFrame(series_dict)
print(df)


   First Series  Second Series
0            10             60
1            20             70
2            30             80
3            40             90
4            50            100


In [6]:
import pandas as pd
import numpy as np

# Using a list of Dictionaries
lst_dic = [{'a':10, 'b':20,'c':30,'d':40},{'a':50, 'b':60,'c':70,'d':80}]
df=pd.DataFrame(lst_dic)
print(df)


lst_dic2 = [{'a':10, 'b':20,'d':40},{'a':50,'c':70,'d':80}]
df=pd.DataFrame(lst_dic2)
print(df)

    a   b   c   d
0  10  20  30  40
1  50  60  70  80
    a     b   d     c
0  10  20.0  40   NaN
1  50   NaN  80  70.0


In [8]:
import pandas as pd
import numpy as np

#Example 1 :
lst1 = ['A','B','C','D']
lst2  = [10,20,30,40]
L1 = []
L1.append(dict(zip(lst1,lst2)))

df= pd.DataFrame(L1)
print(df)

#Example 2 :
character = ['Ritik','Roshan','Hitesh','Roshini','Mansi']
age = [32,42,24,30,28]

lst_of_tuple = list(zip(character,age))
df = pd.DataFrame(lst_of_tuple,columns=['Name','Age'])
print(df)

#Example 3 :
np_array = np.random.randn(25,5)
df1 = pd.DataFrame(np_array,columns=['a','b','c','d','e'])
print(df1)



    A   B   C   D
0  10  20  30  40
      Name  Age
0    Ritik   32
1   Roshan   42
2   Hitesh   24
3  Roshini   30
4    Mansi   28
           a         b         c         d         e
0  -0.369615 -1.033439 -0.129663  0.505958  1.070827
1  -0.451269 -0.402573 -0.556781 -0.215117  0.518730
2   0.713208  0.082397 -0.599600 -0.271221 -0.362115
3  -0.839130  2.720827  0.957723  0.424684  1.781605
4  -0.091315  0.096600 -0.715287 -0.972950  0.688881
5  -0.847928 -0.652439  0.111612  0.871883 -0.123986
6   0.017329 -2.073102  0.177729 -0.972991  0.001321
7   0.825860  1.273057 -0.290330  0.328854 -0.018775
8   0.670076  0.244511 -0.566769  0.334570 -0.433303
9   0.797671  0.133677  1.009347  0.539309  0.318399
10  0.349503  0.293460 -1.556179  0.674220 -2.107672
11 -0.125539 -0.244252  1.157095  0.631141  1.399654
12  1.066374  0.316044 -0.244142  1.086587  1.169632
13 -1.496839  0.888386  1.124468 -0.341764 -0.321717
14  2.719522 -0.005368 -2.394900 -0.095368  0.732616
15 -0.852156 -2.3768

In [9]:
import pandas as pd
import numpy as np

np_array = np.random.randn(25,5)
df1 = pd.DataFrame(np_array,columns=['a','b','c','d','e'])

#dataframe.head() : it gives all columns and first n number of rows as desired from a dataframe.
print(df1.head(10))


          a         b         c         d         e
0  0.181071 -0.912534 -0.949940  0.372621  0.650025
1 -1.130835 -0.557386 -0.041261  1.127203  0.493054
2 -0.603677 -1.013088 -0.038924  0.851229 -0.121431
3  0.698317  0.898630 -0.503683  1.702410  1.544159
4  1.153761  1.402523 -0.716280 -0.455350 -0.530291
5 -0.065764 -0.181799  1.435296 -0.336526 -0.337215
6  1.463793  1.405534  0.117907  0.019636 -2.036101
7  1.225783  1.083134 -0.664777  0.205402  0.324807
8 -1.335707  0.790671 -0.871352  0.356834  1.437900
9 -0.902407 -1.398832  0.180494  1.598242 -1.648961


In [10]:
import pandas as pd
import numpy as np

np_array = np.random.randn(25,5)
df1 = pd.DataFrame(np_array,columns=['a','b','c','d','e'])

#dataframe.tail() : it gives first five columns and last number of rows as desired from a dataframe.
print(df1.tail(8))


           a         b         c         d         e
17  1.550026 -0.036509 -0.447714 -0.335391 -1.023062
18 -0.600780  0.535251  0.742426  0.474444 -1.042312
19  0.876784  1.544068 -0.845623  0.009895 -0.507110
20 -0.477300 -0.943388  1.560990 -0.100292 -1.156817
21 -0.705674  2.156210 -0.473978  0.022073  1.642048
22 -0.553205  0.640205 -0.886992 -0.159623  0.757007
23 -0.931693 -1.500833 -0.122490 -0.782013  0.537215
24  0.459485  0.686594  0.106624 -0.259798  1.590036


In [11]:
import pandas as pd
import numpy as np

np_array = np.random.randn(25,5)
df1 = pd.DataFrame(np_array,columns=['a','b','c','d','e'])

#dataframe.columns : give you columns of a dataframe
print(df1.columns)

#dataframe.info()
print(df1.info())


Index(['a', 'b', 'c', 'd', 'e'], dtype='object')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25 entries, 0 to 24
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   a       25 non-null     float64
 1   b       25 non-null     float64
 2   c       25 non-null     float64
 3   d       25 non-null     float64
 4   e       25 non-null     float64
dtypes: float64(5)
memory usage: 1.1 KB
None


In [12]:
import pandas as pd
import numpy as np

#Renaming Columns
Name = ['Ritik','Roshan','Hitesh','Roshini','Mansi']
age = [32,42,24,30,28]

lst_of_tuple = list(zip(Name,age))
df = pd.DataFrame(lst_of_tuple,columns=['Name','Age'])
print(df)
df.rename(columns={'Name': 'Actor Name', 'Age': 'Actor Age'},inplace=True)
print(df)

      Name  Age
0    Ritik   32
1   Roshan   42
2   Hitesh   24
3  Roshini   30
4    Mansi   28
  Actor Name  Actor Age
0      Ritik         32
1     Roshan         42
2     Hitesh         24
3    Roshini         30
4      Mansi         28


In [None]:
#df = pd.read_csv('file')
#df.to_csv('file')