# Pandas


## Introduction to Pandas

### Introduction to Pandas

Pandas is an open-source library for real world data analysis in python. It is built on top of Numpy. Using Pandas, data can be cleaned, transformed, manipulated, and analyzed. It is suited for different kinds of data including tabular as in a SQL table or a Excel spreadsheets, time series data, observational or statistical datasets.

The steps involved to perform data analysis using Pandas are as follows:

<img src="Assests/11627398032146.PNG">


### Steps in data Analysis

#### Reading the data

The first step is to read the data. There are multiple formats in which data can be obtained such as '.csv', '.json', '.xlsx' etc.

Below are the examples:

<b>Example of an excel file:</b>

<img src="Assests/x641627962603383.PNG">

<b>Example of a json (javascript object notation) file:</b>

<img src="Assests/x611627962355796.PNG">

<b>Example of a csv (comma separated values) file:</b>

<img src="Assests/x631627962488063.PNG">


### Steps in data Analysis

#### Exploring the data

The next step is to explore the data. Exploring data helps to:

<ul>
    <li>know the shape(number of rows and columns) of the data</li>
    <li>understand the nature of the data by obtaining subsets of the data</li>
    <li>identify missing values and treat them accordingly</li>
    <li>get insights about the data using descriptive statistics</li> 
</ul>

#### Performing operations on the data

Some of the operations supported by pandas for data manipulation are as follows:

<ul>
    <li>Grouping operations</li> 
    <li>Sorting operations</li> 
    <li>Masking operations</li> 
    <li>Merging operations</li> 
    <li>Concatenating operations</li> 
</ul>

#### Visualizing data

The next step is to visualize the data to get a clear picture of various relationships among the data. The following plots can help visualize the data:

<ul>
    <li>Scatter plot</li>
    <li>Box plot</li>
    <li>Bar plot</li>
    <li>Histogram and many more</li>
</ul>

#### Generating Insights

All the above steps help generating insights about our data.


### Why Pandas

Pandas is one of the most popular data wrangling and analysis tools because it:

<ul>
    <li>has the capability to load huge sizes of data easily</li>
    <li>provides us with extremely streamlined forms of data representation</li>
    <li>can handle heterogenous data, has extensive set of data manipulation features and makes data flexible and customizable</li>
</ul>


## Introduction to Pandas Objects


### Getting started with Pandas

To get started with Pandas, Numpy and Pandas needs to be imported as shown below:


In [1]:
#Importing libraries
#python library for numerical and scientific computing. pandas is built on top of numpy
import numpy as np 
#importing pandas
import pandas as pd

In a nutshell, Pandas objects are advanced versions of NumPy structured arrays in which the rows and columns are identified with labels instead of simple integer indices.

The basic data structures of Pandas are Series and DataFrame.


### Pandas Series Object

Series is one dimensional labelled array. It supports different datatypes like integer, float, string etc. Let us understand more about series with the following example.

Consider the scenario where marks of students are given as shown in the following table:

<table>
    <tr>
        <th>Student ID</th>
        <th>Marks</th>
    </tr>
    <tr>
        <td>1</td>
        <td>78</td>
    </tr>
    <tr>
        <td>1</td>
        <td>92</td>
    </tr>
    <tr>
        <td>1</td>
        <td>36</td>
    </tr>
    <tr>
        <td>1</td>
        <td>64</td>  
    </tr>
    <tr>
        <td>5</td>
        <td>89</td>
    </tr>     
</table>

The pandas series object can be used to represent this data in a meaningful manner. Series is created using the following syntax:

<b>Syntax:</b>

<ul type="none">
<li><b>pd.Series(data, index, dtype)</b></li>
<li>data – It can be a list, a list of lists or even a dictionary.</li>
<li>index – The index can be explicitly defined for different valuesif required.</li>
<li>dtype – This represents the data type used in the series (optional parameter).</li>
</ul>


In [2]:
series = pd.Series(data = [78, 92, 36, 64, 89])  
series

0    78
1    92
2    36
3    64
4    89
dtype: int64

As shown in the above output, the series object provides the values along with their index attributes.

<b>Series.values</b> provides the values.


In [3]:
series.values

array([78, 92, 36, 64, 89], dtype=int64)

<b>Series.index</b> provides the index.


In [4]:
series.index

RangeIndex(start=0, stop=5, step=1)

#### Accessing data in series

Data can be accessed by the associated index using [ ].


In [5]:
series[1]

92

#### Slicing a series


In [6]:
series[1:3]

1    92
2    36
dtype: int64

### Custom Index in Series

By default, series creates an integer index. The custom index can also be defined.

For example, consider the following table containing car details:

<table>
    <tr>
        <th>Car Name</th>
        <th>Car Price</th>
    </tr>
    <tr>
        <td>Swift</td>
        <td>700000</td>
    </tr>
    <tr>
        <td>Jazz</td>
        <td>800000</td>
    </tr>
    <tr>
        <td>Civic</td>
        <td>1600000</td>
    </tr>
    <tr>
        <td>Altis</td>
        <td>1800000</td>
    </tr>
    <tr>
        <td>Gallardo</td>
        <td>30000000</td>
    </tr>
</table>
 
A Pandas series can be created using the following syntax:


In [7]:
data = pd.Series(data = [700000, 800000, 1600000, 1800000, 30000000], index = ['Swift', 'Jazz', 'Civic', 'Altis', 'Gallardo'])
data


Swift         700000
Jazz          800000
Civic        1600000
Altis        1800000
Gallardo    30000000
dtype: int64

Values can be accessed as:


In [8]:
data['Swift']

700000

In [9]:
data['Jazz': 'Gallardo']

Jazz          800000
Civic        1600000
Altis        1800000
Gallardo    30000000
dtype: int64

In this case, observations are that the output starts from Jazz and goes till Gallardo(inclusive). This is the fundamental difference between implicit and explicit indexing.


### Series as a specialized dictionary

Series can also be viewed as a specialized dictionary where the keys act as index and corresponding values act as values.

Let us create a series out of the dictionary data structure.


In [10]:
#Using dictionary to create a series
car_price_dict = {'Swift':  700000,
                       'Jazz' :  800000,
                       'Civic' : 1600000,
                       'Altis' : 1800000,
                       'Gallardo': 30000000
                      }
car_price = pd.Series(car_price_dict)
car_price

Swift         700000
Jazz          800000
Civic        1600000
Altis        1800000
Gallardo    30000000
dtype: int64

### Pandas DataFrame object

A series gives a useful way to view and manipulate one dimensional data. But when data is present in rows and columns, it becomes necessary to make use of the Pandas DataFrame object. A DataFrame is a collection of series where each series represents a column from a table.

For example, consider the following table containing car details:

<table>
    <tr>
        <th>Car Name</th>
        <th>Car Price</th>
        <th>Car Manufacturer</th>
    </tr>
    <tr>
        <td>Swift</td>
        <td>700000</td>
        <td>Maruti</td>
    </tr>
    <tr>
        <td>Jazz</td>
        <td>800000</td>
        <td>Honda</td>
    </tr>
    <tr>
        <td>Civic</td>
        <td>1600000</td>
        <td>Honda</td>
    </tr>
    <tr>
        <td>Altis</td>
        <td>1800000</td>
        <td>Toyota</td>
    </tr>
    <tr>
        <td>Gallardo</td>
        <td>30000000</td>
        <td>Lamborghini</td>
    </tr>
</table>


Let us create two series from two dictionaries - one containing car name and price and the other with car name and manufacturer.


In [11]:
#Creating a car price series with a dictionary
car_price_dict = {'Swift':  700000,
                       'Jazz' :  800000,
                       'Civic' : 1600000,
                       'Altis' : 1800000,
                       'Gallardo': 30000000
                      }
car_price = pd.Series(car_price_dict)
# Creating the car manufacturer series with a dictionary
car_man_dict = {'Swift' : 'Maruti',
                  'Jazz'   : 'Honda',
                  'Civic'  : 'Honda',
                  'Altis'  : 'Toyota',
                   'Gallardo' : 'Lamborghini'}
car_man = pd.Series(car_man_dict)
print(car_price)
print(car_man)

Swift         700000
Jazz          800000
Civic        1600000
Altis        1800000
Gallardo    30000000
dtype: int64
Swift            Maruti
Jazz              Honda
Civic             Honda
Altis            Toyota
Gallardo    Lamborghini
dtype: object


Let us create a Dataframe object using the series objects as shown below:

<b>Syntax:</b>

<ul >
    <li><b>pd.DataFrame(data, index, columns)</b></li>
    <li>data - data can contain Series or list-like objects. If data is a dictionary, column order follows insertion-order.</li>
    <li>index- index for dataframe that is created. By default, it will be RangeIndex(0, 1, 2, …, n) if no explicit index is provided</li>
    <li>columns-  If data contains column labels, it will use the same . Else, default to RangeIndex(0, 1, 2, …, n).</li>
</ul>


In [12]:
cars = pd.DataFrame({'Price': car_price , 'Manufacturer' : car_man})
cars

Unnamed: 0,Price,Manufacturer
Swift,700000,Maruti
Jazz,800000,Honda
Civic,1600000,Honda
Altis,1800000,Toyota
Gallardo,30000000,Lamborghini


The output shows the Dataframe containing multiple columns. The car names act as the indices and ‘Price’ and ‘Manufacturer’ act as the columns or 'features' of this small dataset.

To access individual features, the following code can be used:


In [13]:
cars['Price']

Swift         700000
Jazz          800000
Civic        1600000
Altis        1800000
Gallardo    30000000
Name: Price, dtype: int64

In [14]:
cars['Manufacturer']

Swift            Maruti
Jazz              Honda
Civic             Honda
Altis            Toyota
Gallardo    Lamborghini
Name: Manufacturer, dtype: object

### Ways to create a DataFrame

There are different approaches to create a DataFrame such as:

#### 1. From a single series object

A DataFrame is a collection of Series objects, and a single-column DataFrame can be constructed from a single Series:


In [15]:
#Using dictionary to create a series
car_price_dict = {'Swift':  700000,
                       'Jazz' :  800000,
                       'Civic' : 1600000,
                       'Altis' : 1800000,
                       'Gallardo': 30000000
                      }
car_price = pd.Series(car_price_dict)
car_price
#Creating a DataFrame from car_price Series
pd.DataFrame(car_price, columns=['Car Price'])

Unnamed: 0,Car Price
Swift,700000
Jazz,800000
Civic,1600000
Altis,1800000
Gallardo,30000000


#### 2. From a list of dictionaries

Consider the following data of marks for four students.

<table>
    <tr>
        <th>Name</th>
        <th>Marks</th>
    </tr>
        <td>Subodh</td>
        <td>28</td>
    </tr>
    <tr>
        <td>Ram</td>
        <td>27</td>
    </tr>
    <tr>
        <td>Abdul</td>
        <td>26</td>
    </tr>
    <tr>
        <td>John</td>
        <td>28</td>
    </tr>
</table>

Following list of dictionaries can be used:


In [16]:
data = [{'Name': 'Subodh', 'Marks': 28},
        {'Name': 'Ram', 'Marks': 27}, 
        {'Name': 'Abdul', 'Marks': 26}, 
        {'Name': 'John', 'Marks': 28}]
pd.DataFrame(data)

Unnamed: 0,Name,Marks
0,Subodh,28
1,Ram,27
2,Abdul,26
3,John,28


Suppose there is a following table to be represented as a dataframe ?

<table>
    <tr>
        <td>Subject</td>
        <td>Subodh</td>
        <td>Ram</td>
        <td>Abdul</td>
        <td>John</td>
    </tr>
    <tr>
        <td>Mathematics</td>
        <td>20</td>
        <td>25</td>
        <td>Not appeared</td>
        <td>Not appeared</td>
    </tr>
    <tr>
        <td>Physics</td>
        <td>Not appeared</td>
        <td>Not Appeared</td>
        <td>29</td>
        <td>24</td>
    </tr>
</table>


In [17]:
pd.DataFrame([{'Subodh':20, 'Ram':25},
              {'Abdul':29, 'John':24}], 
              index = ['Mathematics', 'Physics'])

Unnamed: 0,Subodh,Ram,Abdul,John
Mathematics,20.0,25.0,,
Physics,,,29.0,24.0


Each dictionary element in the list is taken as a row . Index is representing different subjects.

Note: NaN(Not a Number) represents missing values.

#### 3. From a dictionary of series objects

A DataFrame can be constructed from a dictionary of Series objects:


In [18]:
#Using dictionary to create a series
car_price_dict = {'Swift':  700000,
                       'Jazz' :  800000,
                       'Civic' : 1600000,
                       'Altis' : 1800000,
                       'Gallardo': 30000000
                      }
car_price = pd.Series(car_price_dict)
car_man_dict = {'Swift' : 'Maruti',
                  'Jazz'   : 'Honda',
                  'Civic'  : 'Honda',
                  'Altis'  : 'Toyota',
                   'Gallardo' : 'Lamborghini'}
car_man = pd.Series(car_man_dict)
cars = pd.DataFrame({'Price': car_price , 'Manufacturer' : car_man})
cars

Unnamed: 0,Price,Manufacturer
Swift,700000,Maruti
Jazz,800000,Honda
Civic,1600000,Honda
Altis,1800000,Toyota
Gallardo,30000000,Lamborghini


#### 4. From an existing file

In most real world scenarios, the data is in different file formats like csv, xlsx, json etc. Pandas supports reading the data from these files. Below is an example of creating a DataFrame from a json file.


In [22]:
data_json = pd.read_json('Assests/data.json',)
data_json

Unnamed: 0,Duration,Pulse,Maxpulse,Calories
0,60,110,130,409.1
1,60,117,145,479.0
2,60,103,135,340.0
3,45,109,175,282.4
4,45,117,148,406.0
...,...,...,...,...
164,60,105,140,290.8
165,60,110,145,300.4
166,60,115,145,310.2
167,75,120,150,320.4


#### The axis keyword

One of the important parameters used while performing operations on DataFrames is 'axis'. Axis takes two values: 0 and 1.

axis = 0 represents row specific operations.

axis = 1 represents column specific operations.
