<a href="https://colab.research.google.com/github/Programer1554/Data-Science-4-Sem/blob/main/Pandas1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Pandas

Pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with structured(tabular, multidimensional, potentially heterogeneous) and time series data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real-world data analysis in Python.

Pandas is well suited for many different kinds of data:


*   Size mutability: Columns can be inserted and deleted from DataFrame and higher dimensional objects.
*   Intelligent label-based slicing, fancy indexing, and subsetting of large data sets



In [1]:
#Import package
import pandas as pd
import numpy as np

#Version of pandas
pd.__version__

'2.2.2'

from typing_extensions import dataclass_transform
There are two central concepts which you should know about:

* Series: A series is essentially a column

* DataFrame: A DataFrame is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows and columns. Pandas DataFrame consists of three principal components, the data, rows and columns.

In [None]:
#Our first series
number_series= pd.Series([2,3,5,6,8])
print(number_series)

0    2
1    3
2    5
3    6
4    8
dtype: int64


In [None]:
#Same series but with different indices
number_series= pd.Series([2,3,5,6,8], index=['a','b','d','e','f'], name="Number")
print(number_series)

a    2
b    3
d    5
e    6
f    8
Name: Number, dtype: int64


In [None]:
print(number_series[0])
print(number_series['a'])

2
2


  print(number_series[0])


#Let us go and create our first Pandas DataFrame

Using a 2-D Numpy array

In [None]:
#Creating our first Dataframe
some_2D_array=np.random.randint(4,6,(3,4))
first_dataframe= pd.DataFrame(some_2D_array, columns=['A','B','C','D'])
first_dataframe

Unnamed: 0,A,B,C,D
0,5,4,5,5
1,4,5,5,5
2,4,5,5,5


In [None]:
#Let's create the same dataframe but different indices
another_dataframe=pd.DataFrame(some_2D_array, columns=['A','B','C','D'], index=['a','b','c'])
another_dataframe

Unnamed: 0,A,B,C,D
a,5,4,5,5
b,4,5,5,5
c,4,5,5,5


Using a conventional list of lists

In [None]:
#Initialize list of lists
list_of_lists=[['Amar',10],['Akbar',15],['Anthony',14]]
#Create the Pandas DataFrame
df=pd.DataFrame(list_of_lists, columns=['Name','Age'])
df

Unnamed: 0,Name,Age
0,Amar,10
1,Akbar,15
2,Anthony,14


Using a dictionary

In [13]:
#Store the data in a dictionary
employee_dict= {'Employee Name': ['Rajeev', 'Sumit', "Aviral"], 'Income':[200000, 140000, 90000]}

#Create DataFrame
employee_df=pd.DataFrame(employee_dict)

#Print the dataframe
employee_df

Unnamed: 0,Employee Name,Income
0,Rajeev,200000
1,Sumit,140000
2,Aviral,90000


In [12]:
#Store the data in a dictionary
series_dict= {'First Series': pd.Series([10,20,30,40]),
              'Second Series': pd.Series([10,20,30,40])}

#Create DataFrame
series_df=pd.DataFrame(series_dict)

#Print the dataframe
series_df

Unnamed: 0,First Series,Second Series
0,10,10
1,20,20
2,30,30
3,40,40


Using a list of dictionaries

In [11]:
#Initialise the list of dictionaries
list_of_dicts=[{'a':1, 'b':2, 'c':3},
               {'a':10, 'b':20, 'c':30}]

#Create the DataFrame
df=pd.DataFrame(list_of_dicts)

#Print the dataframe
df

Unnamed: 0,a,b,c
0,1,2,3
1,10,20,30


In [None]:
#What happenes if the number of keys are dfferent in the two dictionaries
list_of_dicts=[{'a': 2, 'c': 3}, {'a':10,'b':20,'c':30}]
df=pd.DataFrame(list_of_dicts, index=['first','second'])
df

Unnamed: 0,a,c,b
first,2,3,
second,10,30,20.0


There are other ways too!

In [8]:
#List of names
actor_names=['Ryan Reynolds','Benedict Cumberbatch','Robert Downey Jr.','Chris Evans']
#List of ages
actor_ages= [48,62,54,np.nan]
#Get the list of tuples by zipping the two lists together
list_of_tuples= list(zip(actor_names, actor_ages))
#Converting the list of tuples into pandas Dataframe
actor_df= pd.DataFrame(list_of_tuples, columns= ['Name', 'Age'])
#Print the dataframe
actor_df

Unnamed: 0,Name,Age
0,Ryan Reynolds,48.0
1,Benedict Cumberbatch,62.0
2,Robert Downey Jr.,54.0
3,Chris Evans,


We cab print the indices of our dataframe using the index Attribute

In [None]:
list(actor_df.index)

[0, 1, 2, 3]

Let us create a dataframe with 100 rows and 5 columns

In [2]:
#Create an array of shape (3,5)
our_array= np.random.randn(3,5)
df=pd.DataFrame(our_array, columns=['A','B','C','D','E'])
df

Unnamed: 0,A,B,C,D,E
0,-0.004999,0.018179,-2.06143,1.477727,0.172841
1,1.284931,0.065219,-0.333565,-0.779633,0.35734
2,-0.563419,-0.07704,-1.149384,1.787747,-0.729006


head()


Prints the first n rows of the dataframe. By default it prints the first five rows

In [3]:
df.head()

Unnamed: 0,A,B,C,D,E
0,-0.004999,0.018179,-2.06143,1.477727,0.172841
1,1.284931,0.065219,-0.333565,-0.779633,0.35734
2,-0.563419,-0.07704,-1.149384,1.787747,-0.729006


In [4]:
#Printing 10 rows
df.head(1)

Unnamed: 0,A,B,C,D,E
0,-0.004999,0.018179,-2.06143,1.477727,0.172841


tail()


Prints the last n rows of the dataframe. By default, it prints the last 5 rows

In [5]:
df.tail()

Unnamed: 0,A,B,C,D,E
0,-0.004999,0.018179,-2.06143,1.477727,0.172841
1,1.284931,0.065219,-0.333565,-0.779633,0.35734
2,-0.563419,-0.07704,-1.149384,1.787747,-0.729006


In [6]:
df.tail(1)

Unnamed: 0,A,B,C,D,E
2,-0.563419,-0.07704,-1.149384,1.787747,-0.729006


column


Prints the list of columns in a dataframe

In [7]:
list(df.columns)

['A', 'B', 'C', 'D', 'E']

In [9]:
list(actor_df.columns)

['Name', 'Age']

info()


Prints the basic information about the dataframe

In [10]:
actor_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Name    4 non-null      object 
 1   Age     3 non-null      float64
dtypes: float64(1), object(1)
memory usage: 196.0+ bytes


rename()


Rename the columns of the dataframe

In [14]:
actor_df

Unnamed: 0,Name,Age
0,Ryan Reynolds,48.0
1,Benedict Cumberbatch,62.0
2,Robert Downey Jr.,54.0
3,Chris Evans,


In [15]:
df1=actor_df.rename(columns={'Name': 'Actor Name', 'Age': 'Actor Age'}, inplace=False)
df1

Unnamed: 0,Actor Name,Actor Age
0,Ryan Reynolds,48.0
1,Benedict Cumberbatch,62.0
2,Robert Downey Jr.,54.0
3,Chris Evans,


In [16]:
employee_df

Unnamed: 0,Employee Name,Income
0,Rajeev,200000
1,Sumit,140000
2,Aviral,90000


In [17]:
employee_df.rename(columns={'Employee Name': 'Employee Name', 'Income': 'Monthly Income'}, inplace=True)
employee_df

Unnamed: 0,Employee Name,Monthly Income
0,Rajeev,200000
1,Sumit,140000
2,Aviral,90000


Note: inplace attribute in rename is optional