# Pandas

Pandas are extreme data analytics and manipulating tool. It has 2 data structures:
1. Series: 1D array like object.
2. Dataframe: Ordered collection of columns. Has rows and columns.

In [1]:
import pandas as pd

In [2]:
print(pd.__version__)

1.5.3


## Series

It is a 1D array holding many types of objects.

In [5]:
list=[1, 2, 3, 4, 5]
print(list)
type(list)

[1, 2, 3, 4, 5]


list

In [7]:
ser = pd.Series(list)
ser

0    1
1    2
2    3
3    4
4    5
dtype: int64

Above, the list is converted into 1D array.
The difference between NumPy arrays and series is that indexes can be changed in series according to the users convinience but not in NumPy arrays.
NumPy arrays are faster but if the number of rows is >500K, performance of Pandas is better. If it is less than that, NumPy arrays can be used.

In [8]:
list=[1, 2, 3, 4, 5]
label = ['a', 'b', 'c', 'd', 'e']
list

[1, 2, 3, 4, 5]

In [9]:
ser = pd.Series(list, index = label)
ser

a    1
b    2
c    3
d    4
e    5
dtype: int64

### Creating series using dictionaries

In [11]:
Marks = {"Maths": 43, "Science": 80, "English": 46, "Hindi": 95}
Ser = pd.Series(Marks)
Ser

Maths      43
Science    80
English    46
Hindi      95
dtype: int64

In [13]:
Ser2 = pd.Series(Marks, index = ["Maths", "Science", "English"])
Ser2

Maths      43
Science    80
English    46
dtype: int64

### Accessing a particular column (Series Indexing)

In [14]:
Ser[1]

80

In [16]:
Ser["Science"]

80

### Slicing

In [17]:
Ser[1:3]

Science    80
English    46
dtype: int64

In [18]:
Ser[1:]

Science    80
English    46
Hindi      95
dtype: int64

### Shape and Type

In [19]:
Ser.shape

(4,)

In [20]:
type(Ser)

pandas.core.series.Series

## Dataframe

It is a 2D data structure. 
Pandas DataFrame is two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). A Data frame is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows and columns. Pandas DataFrame consists of three principal components, the data, rows, and columns.

In [22]:
list = [[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12], [13, 14, 15, 16]]
DF = pd.DataFrame(list)
DF

Unnamed: 0,0,1,2,3
0,1,2,3,4
1,5,6,7,8
2,9,10,11,12
3,13,14,15,16


In [25]:
list = [[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12], [13, 14, 15, 16]]
columns = ["a", "b", "c", "d"]
labels = ["A", "B", "C", "D"]
DF = pd.DataFrame(list, index = labels, columns = columns)
DF

Unnamed: 0,a,b,c,d
A,1,2,3,4
B,5,6,7,8
C,9,10,11,12
D,13,14,15,16


### Creating a Data Frame using Dictionaries

In [87]:
DF = pd.DataFrame({"Name": ["Haley", "Alex", "Luke", "Manny"], "Age":[16, 12, 11, 11], "Branch":["Photography", "Science", "Trampolining", "Poetry"]})
DF

Unnamed: 0,Name,Age,Branch
0,Haley,16,Photography
1,Alex,12,Science
2,Luke,11,Trampolining
3,Manny,11,Poetry


### Shape and Type

In [88]:
type(DF)

pandas.core.frame.DataFrame

In [89]:
DF.shape

(4, 3)

### Accessing and modifying rows and columns

In [90]:
DF.columns

Index(['Name', 'Age', 'Branch'], dtype='object')

In [91]:
DF_new = DF.rename(columns={"Name": "Characters"})
DF_new

Unnamed: 0,Characters,Age,Branch
0,Haley,16,Photography
1,Alex,12,Science
2,Luke,11,Trampolining
3,Manny,11,Poetry


In [92]:
DF.columns[[0,1]]

Index(['Name', 'Age'], dtype='object')

In [93]:
DF.columns.values[[0,1]]

array(['Name', 'Age'], dtype=object)

In [94]:
DF.columns[2]

'Branch'

In [95]:
DF.columns.values[2]="Interests"
DF

Unnamed: 0,Name,Age,Interests
0,Haley,16,Photography
1,Alex,12,Science
2,Luke,11,Trampolining
3,Manny,11,Poetry


In [96]:
DF_new.drop("Age", axis = 1) #Axis = 1 -> along the columns

Unnamed: 0,Characters,Branch
0,Haley,Photography
1,Alex,Science
2,Luke,Trampolining
3,Manny,Poetry


In [97]:
DF_new

Unnamed: 0,Characters,Age,Branch
0,Haley,16,Photography
1,Alex,12,Science
2,Luke,11,Trampolining
3,Manny,11,Poetry


After modifying, you have to commit the changes. To commit, use implace=True

In [98]:
DF_new.drop("Age", axis = 1, inplace=True)
DF_new

Unnamed: 0,Characters,Branch
0,Haley,Photography
1,Alex,Science
2,Luke,Trampolining
3,Manny,Poetry


In [99]:
DF_new

Unnamed: 0,Characters,Branch
0,Haley,Photography
1,Alex,Science
2,Luke,Trampolining
3,Manny,Poetry


#### Concatenation

In [114]:
DF_2 = pd.DataFrame({"Name": ["Haley", "Alex", "Luke", "Manny"], "Age":[16, 12, 11, 11], "Branch":["Photography", "Science", "Trampolining", "Poetry"]})
DF_2

Unnamed: 0,Name,Age,Branch
0,Haley,16,Photography
1,Alex,12,Science
2,Luke,11,Trampolining
3,Manny,11,Poetry


In [115]:
DF_3 = pd.DataFrame({"Name": ["H", "A", "L", "M"], "Age":[16, 12, 11, 11], "Branch":["Photography", "Science", "Trampolining", "Poetry"], "College":["No", "Yes", "Yes", "Yes"]})
DF_3

Unnamed: 0,Name,Age,Branch,College
0,H,16,Photography,No
1,A,12,Science,Yes
2,L,11,Trampolining,Yes
3,M,11,Poetry,Yes


In [116]:
DF_3 = pd.concat([DF_2, DF_3])
DF_3

Unnamed: 0,Name,Age,Branch,College
0,Haley,16,Photography,
1,Alex,12,Science,
2,Luke,11,Trampolining,
3,Manny,11,Poetry,
0,H,16,Photography,No
1,A,12,Science,Yes
2,L,11,Trampolining,Yes
3,M,11,Poetry,Yes


## Handling Null/Missing values in Pandas

There are 2 type of handling data:
1. Deleting a specific row or column: Commonly used to handle Null values. If 70-75% of data in a row or a column is Null. Make sure that after removing data does not affect the output/biasing.
2. Imputing the values: Uses the concepts of Mean, Median and Mode. Mean, median, or mode imputation replaces missing values with the mean, median, or mode of the variable, respectively.

In [121]:
import numpy as np
df1=pd.DataFrame({'Name':['Haley',np.nan,'Alex',np.nan,'Luke','Luke'],'score1':[5,4,np.nan,np.nan,4,5],'score2':[np.nan,5,9,np.nan,4,5]})
df1

Unnamed: 0,Name,score1,score2
0,Haley,5.0,
1,,4.0,5.0
2,Alex,,9.0
3,,,
4,Luke,4.0,4.0
5,Luke,5.0,5.0


### Checking if we have Null values in our data.

In [122]:
df1.isnull()

Unnamed: 0,Name,score1,score2
0,False,False,True
1,True,False,False
2,False,True,False
3,True,True,True
4,False,False,False
5,False,False,False


In [124]:
df1.isnull().any()

Name      True
score1    True
score2    True
dtype: bool

For number of Null values:

In [125]:
df1.isnull().sum()

Name      2
score1    2
score2    2
dtype: int64

### Deleting

#### 1. Deleting the rows containing Null values:

In [126]:
df1.dropna()

Unnamed: 0,Name,score1,score2
4,Luke,4.0,4.0
5,Luke,5.0,5.0


In [127]:
df1

Unnamed: 0,Name,score1,score2
0,Haley,5.0,
1,,4.0,5.0
2,Alex,,9.0
3,,,
4,Luke,4.0,4.0
5,Luke,5.0,5.0


To commit changes use:

In [128]:
df1

Unnamed: 0,Name,score1,score2
0,Haley,5.0,
1,,4.0,5.0
2,Alex,,9.0
3,,,
4,Luke,4.0,4.0
5,Luke,5.0,5.0


#### 2. Deleting the columns

The above line will delete the whole table as all columns ave Null values in them.

### Using Mean, Median and Mode for Imputing the values

In [136]:
df1=pd.DataFrame({'Name':['Haley',np.nan,'Alex',np.nan,'Luke','Luke'],'score1':[5,4,np.nan,np.nan,4,5],'score2':[np.nan,5,9,np.nan,4,5]})
df1

Unnamed: 0,Name,score1,score2
0,Haley,5.0,
1,,4.0,5.0
2,Alex,,9.0
3,,,
4,Luke,4.0,4.0
5,Luke,5.0,5.0


Imputing Mean value (most preferred)

In [137]:
df1["score1"].mean()

4.5

In [138]:
df1["score1"] = df1["score1"].fillna(df1["score1"].mean())
df1

Unnamed: 0,Name,score1,score2
0,Haley,5.0,
1,,4.0,5.0
2,Alex,4.5,9.0
3,,4.5,
4,Luke,4.0,4.0
5,Luke,5.0,5.0


Imputing Median value

In [143]:
df1["score2"].median()

5.0

In [140]:
df1["score2"] = df1["score2"].fillna(df1["score2"].median())
df1

Unnamed: 0,Name,score1,score2
0,Haley,5.0,5.0
1,,4.0,5.0
2,Alex,4.5,9.0
3,,4.5,5.0
4,Luke,4.0,4.0
5,Luke,5.0,5.0


Imputing Mode value

In [144]:
df1["Name"].mode()

0    Luke
Name: Name, dtype: object

In [145]:
df1["Name"] = df1["Name"].fillna(df1["Name"].mode()[0])
df1

Unnamed: 0,Name,score1,score2
0,Haley,5.0,5.0
1,Luke,4.0,5.0
2,Alex,4.5,9.0
3,Luke,4.5,5.0
4,Luke,4.0,4.0
5,Luke,5.0,5.0


In [147]:
df1.Name.value_counts()

Luke     4
Haley    1
Alex     1
Name: Name, dtype: int64

### Grouping rows or columns

In [148]:
group = df1.groupby("Name")
group

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000001CCE0761E50>

In [149]:
group.mean()

Unnamed: 0_level_0,score1,score2
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
Alex,4.5,9.0
Haley,5.0,5.0
Luke,4.375,4.75


### Using Loc and Iloc for accessing Data Frame

In [152]:
df1

Unnamed: 0,Name,score1,score2
0,Haley,5.0,5.0
1,Luke,4.0,5.0
2,Alex,4.5,9.0
3,Luke,4.5,5.0
4,Luke,4.0,4.0
5,Luke,5.0,5.0


In [155]:
df2 = df1.loc[0]
df2

Name      Haley
score1      5.0
score2      5.0
Name: 0, dtype: object

In [156]:
type(df2)

pandas.core.series.Series

In [157]:
df1.loc[0:2]

Unnamed: 0,Name,score1,score2
0,Haley,5.0,5.0
1,Luke,4.0,5.0
2,Alex,4.5,9.0
