# **WELCOME**

Content taken from [This Article of medium.com](https://medium.com/data-science-365/pandas-for-data-science-part-1-89bc231b3478)

Importing Libraries

In [1]:
import numpy as np
import pandas as pd

## DATA STRUCTURES IN PANDAS

A data structure is a collection of data values and defines the relationship between the data, and the operations that can be performed on the data. There are three main data structures in pandas:

- Series — 1D
- DataFrame — 2D
- Panel — 3D

![pandasDataStructures](https://miro.medium.com/max/720/1*1iHWBaNA9d_ArysIiuit8A.png)

The most widely used pandas data structures are the Series and the DataFrame. Simply, a Series is similar to a single column of data while a DataFrame is similar to a sheet with rows and columns. Likewise, a Panel can have many DataFrames

### **The pandas Series**

The Series is the object of the pandas library designed to represent one-dimensional data structures, similar to an array but with some additional features. A series consists of two components.

- One-dimensional data (Values)
- Index

![Series](https://miro.medium.com/max/640/1*HS9LGaTo1_GMVaY6vxwW5A.png)

The series is composed of two arrays associated with each other. The main array (array of values) holds one-dimensional data to which each element is associated with a label, contained within the other array (array of labels), called the **index**. If you want to individually see the two arrays that make up the series, you can call `index` and `values` attributes of the series. Because a series is **one dimensional**, it has a single axis (dimension) — the index and the values of the index — 0, 1, 2, 3 — are called **axis labels**.

### **Series Creation**
The general construct for creating a Series data structure is:

![SeriesCreation](https://miro.medium.com/max/720/1*h-a5PLBK15LgpR_x2tGr0g.png)

To create a series, you simply call the `Series()` class constructor and pass as an argument containing the data to be included in it. Here, data can be one of the following:

- A one-dimensional ndarray
- A Python list
- A Python dictionary
- A scalar value

If an index is not specified, the default index [0,… n-1] will be created, where n is the length of the data. 

#### **Series Creation: Using a one-dimensional ndarray**

In [2]:
array = np.arange(1, 10, 2) # An Array of Odd Numbers
series = pd.Series(array)
series

0    1
1    3
2    5
3    7
4    9
dtype: int32

**With Custom Index**

In [3]:
array = np.arange(1, 10, 2) # An Array of Odd Numbers
series = pd.Series(array, index=["A", "B", "C", "D", "E"])
series

A    1
B    3
C    5
D    7
E    9
dtype: int32

If you want to individually see the two arrays that make up this series, you can call `index` and `values` attributes of the series.

In [4]:
series.index

Index(['A', 'B', 'C', 'D', 'E'], dtype='object')

In [5]:
series.values

array([1, 3, 5, 7, 9])

#### **Series Creation: Using a Python `list()`**

In [6]:
pylist  = list(range(1, 10, 2)) # List of Odd numbers
series = pd.Series(pylist, index=["One", "Three", "Five", "Seven", "Nine"])
series

One      1
Three    3
Five     5
Seven    7
Nine     9
dtype: int64

#### **Series Creation: Using a Python dictionary**

In [7]:
pydict = {
            "One": 1,
            "Two": 2,
            "Three": 3,
            "Four": 4,
            "Five": 5
}
series = pd.Series(pydict)
series

One      1
Two      2
Three    3
Four     4
Five     5
dtype: int64

#### Series Creation: Using a scalar value

In [8]:
series = pd.Series(7, index=range(10))
series

0    7
1    7
2    7
3    7
4    7
5    7
6    7
7    7
8    7
9    7
dtype: int64

#### **Selecting elements from a Series**
The indexing and slicing that are applicable to NumPy arrays can be extended to the series because pandas library was built on top of NumPy.

In [9]:
series = pd.Series(range(1, 10, 2), ["One", "Three", "Five", "Seven", "Nine"])
series

One      1
Three    3
Five     5
Seven    7
Nine     9
dtype: int64

- **You can select a single element using the index number or index label of the series.**

In [10]:
# Using index number
series[0] 

1

In [11]:
# Using index label
series["One"]

1

- **You can use the slice : notation with index numbers to select a range of elements from a series.**

In [12]:
# Selcting First Three Elements
series[0:3]

One      1
Three    3
Five     5
dtype: int64

- **You can also use a list of index numbers or index labels to select multiple elements from a series.**

In [13]:
# Selcting fisrt, second and fourth elements
series[[0, 1, 3]]

One      1
Three    3
Seven    7
dtype: int64

In [14]:
# Selecting first and fourth elements
series[["One", "Seven"]]

One      1
Seven    7
dtype: int64

- **You can also use the conditions and Boolean operators to select elements from a series.**

In [15]:
# Selecting all elements greater than 5
series[series > 5]

Seven    7
Nine     9
dtype: int64

In [16]:
# Selecting all elements greater than 1 and smaller than 7
series[(series > 1) & (series < 7)] # or
# series[np.logical_and(series > 1, series < 7)]

Three    3
Five     5
dtype: int64

#### **Assigning Values to the Elements**
`Series` are mutable, which means that you can change the value of an element in the series after it has been initialized.

In [17]:
series["Seven"] = 8
series

One      1
Three    3
Five     5
Seven    8
Nine     9
dtype: int64

#### **Operations and mathematical functions on series**
Operations such as operators `(+, -, *, /)` and mathematical functions that are applicable to NumPy array can be extended to series.

In [18]:
series / series

One      1.0
Three    1.0
Five     1.0
Seven    1.0
Nine     1.0
dtype: float64

In [19]:
series.mean(), series.sum(), series.count() # and many more

(5.2, 26, 5)

### **The pandas DataFrame**

A **DataFrame** is a two-dimensional data structure composed of rows and columns — exactly like a simple spreadsheet or a SQL table. *Each column of a **DataFrame** is a pandas `Series`*. These columns should be of the same length, but they can be of different data types — float, int, bool, and so on.

**DataFrames** are both value-mutable and size-mutable (`Series`, by contrast, is only value-mutable, not size-mutable. The length of a `Series` cannot be changed although the values can be changed). This lets us perform operations that would alter values held within the **DataFrame** or add/delete columns to/from the **DataFrame**.

A DataFrame consists of three components.

- Two-dimensional data (Values)
- Row index
- Column index

![DataFrame](https://miro.medium.com/max/640/1*TZue1h13157x5VcAruHThg.png)

The **DataFrame** has two index arrays. The first index array has very similar functions to the index array in series. In fact, each label is associated with all the values in the row. The second array contains a series of labels, each associated with a particular column. There are two axes (dimensions) for a **DataFrame** which are commonly referred to as axis 0 and 1, or the row/index axis and the column axis respectively.

#### **DataFrame Creation**

The general construct for creating a DataFrame data structure is:

![DataFrameCreation](https://miro.medium.com/max/720/1*LFLvrIyuUjm5fHMDi7cnyw.png)

A DataFrame is the most commonly used data structure in pandas. The `DataFrame()` class constructor accepts many different types of arguments:

- A two-dimensional ndarray
- A dictionary of dictionaries
- A dictionary of lists/series
- A list of series/lits
- Using pandas read_*

Row label indexes and column labels can be specified along with the data. If they’re not specified, they will be generated from the input data in an intuitive fashion. 

#### **DataFrame Creation: Using a two-dimensional ndarray**

In [20]:
array = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]) # 2d array
df = pd.DataFrame(array, index=["Row 1", "Row 2", "Row 3"], columns=["Col 1", "Col 2", "Col 3"])
df

Unnamed: 0,Col 1,Col 2,Col 3
Row 1,1,2,3
Row 2,4,5,6
Row 3,7,8,9


In [21]:
# Accessing values, index and columns of DataFrame
df.values, df.index, df.columns

(array([[1, 2, 3],
        [4, 5, 6],
        [7, 8, 9]]),
 Index(['Row 1', 'Row 2', 'Row 3'], dtype='object'),
 Index(['Col 1', 'Col 2', 'Col 3'], dtype='object'))

#### **DataFrame Creation: Using a dictionary of dictionaries**

In [22]:
data = {
         "name"  : {"A" : "Ali", "B": "Ahmad", "C": "Rajab"},
         "age"   : {"A" : 15, "B" : 17, "C": 18},
         "marks" : {"A": 720, "B": 645, "C": 602}
}
df = pd.DataFrame(data)
df

Unnamed: 0,name,age,marks
A,Ali,15,720
B,Ahmad,17,645
C,Rajab,18,602


#### **DataFrame Creation: Using a dictionary of lists/series**

In [23]:
data = {
         "name"  : pd.Series({"A" : "Ali", "B": "Ahmad", "C": "Rajab"}),
         "age"   : [15, 17, 18],
         "marks" : [720, 645, 602]
}
df = pd.DataFrame(data)
df

Unnamed: 0,name,age,marks
A,Ali,15,720
B,Ahmad,17,645
C,Rajab,18,602


#### **DataFrame Creation: Using a list of lists/series**

In [24]:
data = [["Ali", 15, 720], ["Ahmad", 17, 645], ["Rajab", 18, 602]]
df = pd.DataFrame(data, index=["A", "B", "C"], columns=["name", "age", "marks"])
df

Unnamed: 0,name,age,marks
A,Ali,15,720
B,Ahmad,17,645
C,Rajab,18,602


In [25]:
# index of Series must be matched with the columns of DataFrame
data = [pd.Series(["Ali", 15, 720], index=["name", "age", "marks"]),
        pd.Series(["Ahmad", 17, 645], index=["name", "age", "marks"]),
        pd.Series(["Rajab", 18, 602], index=["name", "age", "marks"])]

df = pd.DataFrame(data, index=["A", "B", "C"], columns=["name", "age", "marks"])
df

Unnamed: 0,name,age,marks
A,Ali,15,720
B,Ahmad,17,645
C,Rajab,18,602


#### **DataFrame Creation: Using pandas read_***
![](https://miro.medium.com/max/640/1*COeZklYLFwq0dhk7rr8pfg.png)

- pandas `read_csv()` function: Reads a comma-separated values (csv) file or a text file into a pandas DataFrame.
- pandas `read_excel()` function: Reads an Excel file into a pandas DataFrame.
- pandas `read_html()` function: Reads HTML tables.
- pandas `read_sql()` function: Read SQL query or database table into a DataFrame.

Now, we discuss an example of reading a text (.txt) file into a pandas DataFrame. For this, I use the pandas `read_csv()` function. CSV stands for comma-separated values, which means that the comma is the default delimiter for these files. However, they accept other delimiters such as tab as well. The syntax of using `read_csv()` is shown in the following code.

![read_csv](https://miro.medium.com/max/720/1*xbdtYhI0nwcxCn2PQ2gI7w.png)

In [68]:
df = pd.read_csv('data.txt')
df

Unnamed: 0,name,age,marks
0,Ali,15,720
1,Ahmad,17,450
2,Rajab,18,602


### **Selecting elements from a DataFrame**

- **Selecting single column/feature using column name**

In [69]:
# single column
df['name'] # or
# df.name

0      Ali
1    Ahmad
2    Rajab
Name: name, dtype: object

- **Selecting multiple columns/features using a list of column names**

In [38]:
# multiple columns
df[['name', 'age']]

Unnamed: 0,name,age
0,Ali,15
1,Ahmad,17
2,Rajab,18


- **Selcting Multiple Rows using slicing** 

In [70]:
# Selecting first two rows using slicing
df[:2]

Unnamed: 0,name,age,marks
0,Ali,15,720
1,Ahmad,17,450


- **Selecting using Conditions and Boolean operators**

In [86]:
# Selecting rows with age greater than 15
df[df["age"] > 15]

Unnamed: 0,name,age,marks
B,Ahmad,17,450
C,Rajab,18,602


In [71]:
# Selecting rows with age less than 18 and marks greater than 700
df[np.logical_and(df["age"] < 18, df["marks"] > 700)]

Unnamed: 0,name,age,marks
0,Ali,15,720


- **Selecting using callable indexer**

In [110]:
# Selecting name and age columns
df[lambda df : ["name", "marks"]]

Unnamed: 0,name,marks
A,Ali,720
B,Ahmad,450
C,Rajab,602


- **You can use these indexing methods**

| Sr.No	|Indexing               | Description              |
| :---  |        :---:          |   :---:                  |
| 1	    | `.loc[]`               | Index Label based              |
| 2     | `.iloc[]`               | Index Number based              |                     


In [72]:
# Changing index of df
df.index = map(chr, range(65, 65+len(df)))
df

Unnamed: 0,name,age,marks
A,Ali,15,720
B,Ahmad,17,450
C,Rajab,18,602


#### `.loc[]`

`.loc` takes two single/list/range operator separated by `','`. The first one indicates the row and the second one indicates columns.

It has multiple access methods like −

- A single scalar label
- A list of labels
- A slice object
- A Boolean array
- A callable indexer

In [77]:
# Scaler Label
df.loc["A", "marks"]

720

In [79]:
# List of Labels
df.loc[['A', 'B'], ["name", "marks"]]

Unnamed: 0,name,marks
A,Ali,720
B,Ahmad,450


In [82]:
# Slicing
df.loc[:'B', :'age']

Unnamed: 0,name,age
A,Ali,15
B,Ahmad,17


In [84]:
# Conditional Operation/Boolean Array
df.loc[df.loc[:, 'marks'] > 600]

Unnamed: 0,name,age,marks
A,Ali,15,720
C,Rajab,18,602


In [107]:
# callable indexer
df.loc[lambda df : df.loc[:, 'marks'] > 600, ['name', "age"]]

Unnamed: 0,name,age
A,Ali,15
C,Rajab,18


#### `.iloc[]`
Similar to `.loc[]` but index number based

The various access methods are as follows −

- An Integer
- A list of integers
- A slicing object
- A callable indexer

**Note:** Boolean Array Indexing is not Supported.

In [90]:
# An integer (selecting marks of A)
df.iloc[0, 2]

720

In [91]:
# A list of integers
df.iloc[[0, 1], [0, 2]]

Unnamed: 0,name,marks
A,Ali,720
B,Ahmad,450


In [94]:
# Slicing
df.iloc[:2, :2]

Unnamed: 0,name,age
A,Ali,15
B,Ahmad,17


In [100]:
df['marks'] > 600

A     True
B    False
C     True
Name: marks, dtype: bool

In [103]:
# callable indexer
df.iloc[lambda df : [0, 2], lambda df : [0, 2]]

Unnamed: 0,name,marks
A,Ali,720
C,Rajab,602


### **Assigning values to the elements and adding new columns**


**DataFrames** are both value-mutable and size-mutable. This means that you can change values held within the DataFrame or add/delete columns to/from the DataFrame.

In [32]:
df

Unnamed: 0,name,age,marks
0,Ali,15,720
1,Ahmad,17,45
2,Rajab,18,602


Value mutability (changing values)

In [33]:
# Updating age and marks of Ali
df.loc[1] = ["Ahmad", 16, 702]
df

Unnamed: 0,name,age,marks
0,Ali,15,720
1,Ahmad,16,702
2,Rajab,18,602


Size mutability (Adding new rows and columns)

In [34]:
df['grade'] = ['A', 'A', 'B']
df

Unnamed: 0,name,age,marks,grade
0,Ali,15,720,A
1,Ahmad,16,702,A
2,Rajab,18,602,B


### **The pandas Panel**

A **Panel** is a 3D array. It is not as widely used as Series or DataFrames. It is not as easily displayed on screen or visualized as the other two because of its 3D nature. It is generally used for 3D time-series data. The three-axis names are as follows:

- items: This is axis 0. Each item corresponds to a DataFrame structure.
- major_axis: This is axis 1. Each item corresponds to the rows of the DataFrame structure.
- minor_axis: This is axis 2. Each item corresponds to the columns of each DataFrame structure.
  
As with Series and DataFrames, there are different ways to create Panel objects.

#### **Panel Creation: Using a 3D NumPy array**

![](https://miro.medium.com/max/640/1*jAoClstWiaICcXgOIb1mHQ.png)

it is not available in new versions of pandas.