# PANDAS

<strong>Pandas</strong> is an open-source library that is made mainly for working with relational or labeled data both easily and intuitively. It provides various data structures and operations for manipulating numerical data and time series. This library is built on top of the NumPy library. Pandas is fast and it has high performance & productivity for users.

<strong>History:</strong> Pandas were initially developed by Wes McKinney in 2008 while he was working at AQR Capital Management. He convinced the AQR to allow him to open source the Pandas. Another AQR employee, Chang She, joined as the second major contributor to the library in 2012. Over time many versions of pandas have been released. The latest version of the pandas is 1.5.3, released on Jan 18, 2023.

<strong>Advantages</strong>

    Fast and efficient for manipulating and analyzing data.
    Data from different file objects can be loaded.
    Easy handling of missing data (represented as NaN) in floating point as well as non-floating point data
    Size mutability: columns can be inserted and deleted from DataFrame and higher dimensional objects
    Data set merging and joining.
    Flexible reshaping and pivoting of data sets
    Provides time-series functionality.
    Powerful group by functionality for performing split-apply-combine operations on data sets.

<strong>Getting Started</strong>

The first step of working in pandas is to ensure whether it is installed in the Python folder or not.  If not then we need to install it in our system using pip command. Type cmd command in the search box and locate the folder using cd command where python-pip file has been installed.  After locating it, type the command:

In [None]:
#!pip install pandas

After the pandas have been installed into the system, you need to import the library. This module is generally imported as:

In [2]:
import pandas as pd

Here, pd is referred to as an alias to the Pandas. However, it is not necessary to import the library using the alias, it just helps in writing less amount code every time a method or property is called. 

Pandas generally provide two data structures for manipulating data, They are: 

    Series
    DataFrame



## Series

<strong>Series:</strong> Pandas Series is a one-dimensional labeled array capable of holding data of any type (integer, string, float, python objects, etc.). The axis labels are collectively called indexes. Pandas Series is nothing but a column in an excel sheet. Labels need not be unique but must be a hashable type. The object supports both integer and label-based indexing and provides a host of methods for performing operations involving the index.

In [3]:
## Creating empty series
d=pd.Series()

  d=pd.Series()


In [4]:
d

Series([], dtype: float64)

In [5]:
type(d)

pandas.core.series.Series

In [6]:
## Creating series with one element
d=pd.Series(17)

In [7]:
d

0    17
dtype: int64

In [8]:
## Creating series using tuple
t=(10,11,12)
d=pd.Series(t)

In [9]:
d

0    10
1    11
2    12
dtype: int64

In [10]:
##  creating series using list
d=pd.Series([45,78,56,445,78])

In [11]:
d

0     45
1     78
2     56
3    445
4     78
dtype: int64

In [12]:
## Creating series with array
import numpy as np
arr=np.array([1,2,3,4])
d=pd.Series(arr)


In [13]:
d

0    1
1    2
2    3
3    4
dtype: int64

In [14]:
## Using dictonary
d={'a':1,'b':2,'c':3,'d':4}

In [15]:
b=pd.Series(d)

In [16]:
b

a    1
b    2
c    3
d    4
dtype: int64

In [17]:
arr_2=np.array([1,2,3])
s_2=pd.Series(arr_2 ,index=['one','two','three'])
s_2


one      1
two      2
three    3
dtype: int64

In [18]:
## checking the datatype
type(s_2)

pandas.core.series.Series

In [19]:
## access the data
b[3]

4

In [20]:
b['a']

1

In [21]:
s_2['one']

1

In [22]:
## Slicing operation
d=pd.Series([1,2,3,5,47,98,7,8,6,32,78,2,8,289,258,78])

In [23]:
d[0:9]

0     1
1     2
2     3
3     5
4    47
5    98
6     7
7     8
8     6
dtype: int64

In [24]:
d[[4,5,3,6]] ## multiple index

4    47
5    98
3     5
6     7
dtype: int64

In [25]:
d[4]=34  ## changing the element

In [26]:
d[4]

34

In [27]:
arr3=np.array([1,2,4,6])
s=pd.Series(arr3,index=['one','Two','Three','Four'])   
s[0:9]


one      1
Two      2
Three    4
Four     6
dtype: int64

# DataFrame

Pandas DataFrame is a two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). A Data frame is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows and columns. Pandas DataFrame consists of three principal components, the data, rows, and columns.

## Creating a DataFrame:

In the real world, a Pandas DataFrame will be created by loading the datasets from existing storage, storage can be SQL Database, CSV file, an Excel file. Pandas DataFrame can be created from the lists, dictionary, and from a list of dictionaries, etc.

In [28]:
## empty dataframe
d=pd.DataFrame()

In [29]:
print(d)

Empty DataFrame
Columns: []
Index: []


In [30]:
## using nested list
data=[[1,2,3,4],[14,12,1,'a',8],[7,8,9]]

In [31]:
d=pd.DataFrame(data)

In [32]:
d

Unnamed: 0,0,1,2,3,4
0,1,2,3,4,
1,14,12,1,a,8.0
2,7,8,9,,


In [33]:
data=[['abc',1],['def',2],['ghi',3],['hij'],['klm','a']]

In [34]:
d=pd.DataFrame(data,columns=['A','B']) ## chnaging the column 

In [35]:
d

Unnamed: 0,A,B
0,abc,1
1,def,2
2,ghi,3
3,hij,
4,klm,a


In [36]:
d=pd.DataFrame(data,index=['A','B','C','D','E'])  ## changing the index

In [37]:
d

Unnamed: 0,0,1
A,abc,1
B,def,2
C,ghi,3
D,hij,
E,klm,a


In [38]:
## dataframe using dictonary
data={'name':['a','b','c','d','e','f'],'age':[45,12,48,25,12,56]}

In [39]:
d=pd.DataFrame(data)

In [40]:
d

Unnamed: 0,name,age
0,a,45
1,b,12
2,c,48
3,d,25
4,e,12
5,f,56


In [41]:
## Task:- change the name of columns while defining the dataframe

In [42]:
d.rename({'name':'Name','age':'Age'},axis=1) #,inplace=True)

Unnamed: 0,Name,Age
0,a,45
1,b,12
2,c,48
3,d,25
4,e,12
5,f,56


In [43]:
d

Unnamed: 0,name,age
0,a,45
1,b,12
2,c,48
3,d,25
4,e,12
5,f,56


In [44]:
## Accessing the elements from dataframe
data=[['ALex',10,'Maths'],['Bob',12,'Science'],['Kelly',15,'Eco'],['Boris',14,'Geo'],['Ken',18,'English']]

In [45]:
d=pd.DataFrame(data,columns=['Name','Age','Subject'])

In [46]:
d

Unnamed: 0,Name,Age,Subject
0,ALex,10,Maths
1,Bob,12,Science
2,Kelly,15,Eco
3,Boris,14,Geo
4,Ken,18,English


In [47]:
## accessing one single column
d['Name']

0     ALex
1      Bob
2    Kelly
3    Boris
4      Ken
Name: Name, dtype: object

In [48]:
d[['Name','Subject']]

Unnamed: 0,Name,Subject
0,ALex,Maths
1,Bob,Science
2,Kelly,Eco
3,Boris,Geo
4,Ken,English


In [49]:
d.iloc[:,[0,1]]

Unnamed: 0,Name,Age
0,ALex,10
1,Bob,12
2,Kelly,15
3,Boris,14
4,Ken,18


In [50]:
d.iloc[1:3,[0,1]]

Unnamed: 0,Name,Age
1,Bob,12
2,Kelly,15


In [51]:
d

Unnamed: 0,Name,Age,Subject
0,ALex,10,Maths
1,Bob,12,Science
2,Kelly,15,Eco
3,Boris,14,Geo
4,Ken,18,English


In [52]:
## to acces the single element we use iloc it will always work with index
d.iloc[0,2] ## index of row ,columns 

'Maths'

In [53]:
## loc will work on rows as well as names of columns
d.loc[:,['Name','Subject']]



Unnamed: 0,Name,Subject
0,ALex,Maths
1,Bob,Science
2,Kelly,Eco
3,Boris,Geo
4,Ken,English


In [55]:
d.loc[:1]

Unnamed: 0,Name,Age,Subject
0,ALex,10,Maths
1,Bob,12,Science


In [56]:
d.loc[3] ## getting the rows

Name       Boris
Age           14
Subject      Geo
Name: 3, dtype: object

In [57]:
d.iloc[:,[1,2]]


Unnamed: 0,Age,Subject
0,10,Maths
1,12,Science
2,15,Eco
3,14,Geo
4,18,English


In [58]:
d.iloc[:,1:3]


Unnamed: 0,Age,Subject
0,10,Maths
1,12,Science
2,15,Eco
3,14,Geo
4,18,English


In [59]:
pwd

'/Users/mac/Documents/datascienceonecampus/DatamitesTraining/PandasTutorial'