Pandas is an open source python library that offers a diverse set of tools that aid in performing data analysis more efficiently. The Pandas package is mainly used for data pre-processing purposes such as data cleaning, manipulation and transformation. Hence, it is a very handy tool for Data Scientist and analysts. 

# Installing and Importing Pandas

In [2]:
import pandas as pd
import numpy as np

# Data Structures in Pandas

Data Structure refers to the speacilized way of organising, processing and storing data to apply specific types of functionalities to them.

Pandas has two main types of Data Structures based on their dependability - 
1. Series - 1D labeled array
2. Dataframes - 2D labeled tabular structure.

# Series Vs Dataframes

Let’s summarize the difference between the two structures in a table:

Pandas Series : Size-immutable – Once created, the size of a Series object cannot be changed.Homogenous – Series elements must be of the same data type. One -Dimensional 

Pandas DataFrame :Size-mutable – Elements can be dropped or added in an existing DataFrame.Heterogenous – DataFrame elements can have different data types.Two-dimensional

# Creating Pandas Series Using Dictionary

In [2]:
data = {"Mon":22,"Tue":23,"Wed":24,"Fri":25,"Sat":26,"Sun":27}
series = pd.Series(data = data, name = "series_from_dict")
series

Mon    22
Tue    23
Wed    24
Fri    25
Sat    26
Sun    27
Name: series_from_dict, dtype: int64

If data is of dict type and index is not specified, the dict keys will be the index labels.

# Creating a Pandas Series Using ndarray.

If data is a ndarray, the index must be of the same length as the array. If index is not specified, it will be created automatically with values:[0, ..., len(data) - 1]

Numpy library has a function random.randint() that produces a ndarray populated with random integers let's use that here:

In [5]:
data = np.random.randn(5)
series = pd.Series(data = data,
                  index = ['1','two','three','four','five'],
                  name = 'series_from_ndarray')
series

1       -0.483790
two     -1.691492
three   -0.403050
four     0.497975
five     0.042025
Name: series_from_ndarray, dtype: float64

# Creating a Pandas Series Using Scalar Values 

The data can be assigned a single value. The index has to be provided in this case. The given value will be repeated up to the length of the index.

In [10]:
eries = pd.Series(data = 7.3,
                  index = ['s','t','u','v'],
                  name = 'series_from_scalar')
print(eries)

s    7.3
t    7.3
u    7.3
v    7.3
Name: series_from_scalar, dtype: float64


# Pandas Dataframe 

Pandas Dataframe, on the other hand, is a two dimensional structure with columns and rows whose size can be changed. You can also think of it as a dictionary of Series object.

In [16]:
df = pd.DataFrame(data = None, index = None, columns = None, dtype = None)
df #Since we have created an empty Dataframe we will see nothing when we call for the 
     #dataframe.

The data parameter can take any of the following data types:
1. Dictionary
2. 2D ndarray
3. Pandas Series
4. Another Pandas Dataframe

The index parameter can also be passed optionally and it accepts row labels.

The columns parameters can also be passed optionally and it accepts column labels.

The dtype parameter sets the data type of the Dataframe.

Creating a Pandas Dataframe Using a Dictionary of Pandas Series.

The index must be the same length as the Series. If index is not specified, it will be created automatically with values: [0, ...., len(data) - 1]

In [3]:
data = pd.DataFrame({
    "Class 1": pd.Series([22,33,38], index = ["math avg", "science avg", "english avg"]),
    "Class 2": pd.Series([45, 28, 36], index=["math avg", "science avg",  "english avg"]),
    "Class 3": pd.Series([32, 41, 47], index=["math avg", "science avg",  "english avg"])
})

data

Unnamed: 0,Class 1,Class 2,Class 3
math avg,22,45,32
science avg,33,28,41
english avg,38,36,47


In the above cell we have created a dataframe that has series of data in it.

In [7]:
type(data['Class 1'])

pandas.core.series.Series

# Creating a Pandas Dataframe using a Dictionary of Lists or ndarrays

The ndarrays must all be of the same length. The index must be of the same length as the arrays. If the index is not specified, the results will be the range(n), where n is the array length.

In [9]:
data = {
    "Class 1": [22,33,38],
    "Class 2": [45,28,36],
    "Class 3":[32,41,47]
}

df = pd.DataFrame(data = data,  index = ['math avg', 'science avg','english avg'])

In [10]:
df

Unnamed: 0,Class 1,Class 2,Class 3
math avg,22,45,32
science avg,33,28,41
english avg,38,36,47


# Creating a Pandas Dataframe Using a List of Dictionaries

In [11]:
data = [{'col1': 1, 'col2': 2},{'col1':5, 'col2':10, 'col3': 20}]
pd.DataFrame(data = data)

Unnamed: 0,col1,col2,col3
0,1,2,
1,5,10,20.0


# Creating a Pandas Dataframe Using a Series

When you create a Dataframe using a Series, the resuting Dataframe will have one column whose name is the original name of the Series.

In [12]:
data = pd.DataFrame({"col 1": pd.Series([22,33,38])})
data

Unnamed: 0,col 1
0,22
1,33
2,38


After the creation of a DataFrame, you can query it and select, add, or delete columns from it, i.e., perform Data Manipulation. 



Pandas DataFrame can be queried in multiple ways – such as loc[] and iloc[] methods – .iloc[] can be used to query using the index/position of the value and .loc[] to query using the user-defined keys.

