# Pandas

## Introduction to Pandas

### Introduction to Pandas

Pandas is an open-source library for real world data analysis in python. It is built on top of Numpy. Using Pandas, data can be cleaned, transformed, manipulated, and analyzed. It is suited for different kinds of data including tabular as in a SQL table or a Excel spreadsheets, time series data, observational or statistical datasets.

The steps involved to perform data analysis using Pandas are as follows:

<img src="Assests/11627398032146.PNG">

### Steps in data Analysis

#### Reading the data

The first step is to read the data. There are multiple formats in which data can be obtained such as '.csv', '.json', '.xlsx' etc. 

Below are the examples:

<b>Example of an excel file:</b>

<img src="Assests/x641627962603383.PNG">

<b>Example of a json (javascript object notation) file:</b>

<img src="Assests/x611627962355796.PNG">

<b>Example of a csv (comma separated values) file:</b>

<img src="Assests/x631627962488063.PNG">

### Steps in data Analysis

#### Exploring the data

The next step is to explore the data. Exploring data helps to:

<ul>
    <li>know the shape(number of rows and columns) of the data</li>
    <li>understand the nature of the data by obtaining subsets of the data</li>
    <li>identify missing values and treat them accordingly</li>
    <li>get insights about the data using descriptive statistics</li> 
</ul>

#### Performing operations on the data

Some of the operations supported by pandas for data manipulation are as follows:

<ul>
    <li>Grouping operations</li> 
    <li>Sorting operations</li> 
    <li>Masking operations</li> 
    <li>Merging operations</li> 
    <li>Concatenating operations</li> 
</ul>

#### Visualizing data

The next step is to visualize the data to get a clear picture of various relationships among the data. The following plots can help visualize the data:

<ul>
    <li>Scatter plot</li>
    <li>Box plot</li>
    <li>Bar plot</li>
    <li>Histogram and many more</li>
</ul>

#### Generating Insights

All the above steps help generating insights about our data. 

### Why Pandas

Pandas is one of the most popular data wrangling and analysis tools because it:

<ul>
    <li>has the capability to load huge sizes of data easily</li>
    <li>provides us with extremely streamlined forms of data representation</li>
    <li>can handle heterogenous data, has extensive set of data manipulation features and makes data flexible and customizable</li>
</ul>

## Introduction to Pandas Objects

### Getting started with Pandas

To get started with Pandas, Numpy and Pandas needs to be imported as shown below:

In [1]:
#Importing libraries
#python library for numerical and scientific computing. pandas is built on top of numpy
import numpy as np 
#importing pandas
import pandas as pd

In a nutshell, Pandas objects are advanced versions of NumPy structured arrays in which the rows and columns are identified with labels instead of simple integer indices.

The basic data structures of Pandas are Series and DataFrame. 

### Pandas Series Object

Series is one dimensional labelled array. It supports different datatypes like integer, float, string etc. Let us understand more about series with the following example.

Consider the scenario where marks of students are given as shown in the following table:

<table>
    <tr>
        <th>Student ID</th>
        <th>Marks</th>
    </tr>
    <tr>
        <td>1</td>
        <td>78</td>
    </tr>
    <tr>
        <td>1</td>
        <td>92</td>
    </tr>
    <tr>
        <td>1</td>
        <td>36</td>
    </tr>
    <tr>
        <td>1</td>
        <td>64</td>  
    </tr>
    <tr>
        <td>5</td>
        <td>89</td>
    </tr>     
</table>

The pandas series object can be used to represent this data in a meaningful manner. Series is created using the following syntax:

<b>Syntax:</b>
<ul type="none">
<li><b>pd.Series(data, index, dtype)</b></li>
<li>data – It can be a list, a list of lists or even a dictionary.</li>
<li>index – The index can be explicitly defined for different valuesif required.</li>
<li>dtype – This represents the data type used in the series (optional parameter).</li>
</ul>


In [2]:
series = pd.Series(data = [78, 92, 36, 64, 89])  
series

0    78
1    92
2    36
3    64
4    89
dtype: int64

As shown in the above output, the series object provides the values along with their index attributes. 

<b>Series.values</b> provides the values.

In [3]:
series.values

array([78, 92, 36, 64, 89], dtype=int64)

<b>Series.index</b> provides the index.

In [4]:
series.index

RangeIndex(start=0, stop=5, step=1)

#### Accessing data in series

Data can be accessed by the associated index using [ ]. 

In [5]:
series[1]

92

#### Slicing a series

In [6]:
series[1:3]

1    92
2    36
dtype: int64

### Custom Index in Series

By default, series creates an integer index. The custom index can also be defined. 

For example, consider the following table containing car details:

<table>
    <tr>
        <th>Car Name</th>
        <th>Car Price</th>
    </tr>
    <tr>
        <td>Swift</td>
        <td>700000</td>
    </tr>
    <tr>
        <td>Jazz</td>
        <td>800000</td>
    </tr>
    <tr>
        <td>Civic</td>
        <td>1600000</td>
    </tr>
    <tr>
        <td>Altis</td>
        <td>1800000</td>
    </tr>
    <tr>
        <td>Gallardo</td>
        <td>30000000</td>
    </tr>
</table>
 
A Pandas series can be created using the following syntax:

In [7]:
data = pd.Series(data = [700000, 800000, 1600000, 1800000, 30000000], index = ['Swift', 'Jazz', 'Civic', 'Altis', 'Gallardo'])
data


Swift         700000
Jazz          800000
Civic        1600000
Altis        1800000
Gallardo    30000000
dtype: int64

Values can be accessed as:

In [8]:
data['Swift']

700000

In [9]:
data['Jazz': 'Gallardo']

Jazz          800000
Civic        1600000
Altis        1800000
Gallardo    30000000
dtype: int64

In this case, observations are that the output starts from Jazz and goes till Gallardo(inclusive). This is the fundamental difference between implicit and explicit indexing.