# <font color = 'green'>Pandas - I</font>
## Source:
 - Book: [McKinney | (p 98 / 529)]
 - Book: [Nelli | (p 69 / 576)]
 
## Topics:
 - [Part I: Pandas Basics](#Part-I:-Pandas-Basics)
 - [Part II: Pandas Series Objects](#Part-II:-Pandas-Series-Objects)
 - [Part III: Pandas DataFrame Objects](#Part-III:-Pandas-DataFrame-Objects)
 - [Part IV: Pandas DataFrame Properties](#Part-IV:-Pandas-DataFrame-Properties)
     - [4.1 DataFrame Properties](##4.1-DataFrame-Properties)
     - [4.2 DataFrame Properties II: Selection, Indexing and Methods](##4.2-DataFrame-Properties-II:-Selection,-Indexing-and-Methods)
     - [4.3 DataFrame Properties III: Removing Rows and Columns](##4.3-DataFrame-Properties-III:-Removing-Rows-and-Columns)
     - [4.4 DataFrame Properties IV: Conditional Selection](##4.4-DataFrame-Properties-IV:-Conditional-Selection)
     - [4.5 DataFrame Properties V: More Index Details](##4.5-DataFrame-Properties-V:-More-Index-Details)
 - [Part V: Pandas DataFrame Properties: Data Cleaning](#Part-V:-Pandas-DataFrame-Properties:-Data-Cleaning)

# <font color = 'green'>Part I: Pandas Basics</font>

Pandas contains data structures and data manipulation tools designed to make data cleaning
and analysis fast and easy in Python. pandas is often used in tandem with numerical
computing tools like NumPy and SciPy, analytical libraries like statsmodels and
scikit-learn, and data visualization libraries like matplotlib. pandas adopts significant
parts of NumPy’s idiomatic style of array-based computing, especially array-based
functions and a preference for data processing without for loops.
While pandas adopts many coding idioms from NumPy, the biggest difference is that
pandas is designed for working with tabular or heterogeneous data. NumPy, by contrast,
is best suited for working with homogeneous numerical array data.

To get started with pandas, you will need to get comfortable with its two workhorse
data structures: Series and DataFrame. While they are not a universal solution for
every problem, they provide a solid, easy-to-use basis for most applications.

# <font color = 'green'>Part II: Pandas Series Objects</font>

## <font color = 'blue'>2.1 Series Objects</font>

A Series is a one-dimensional array-like object containing a sequence of values (of
similar types to NumPy types) and an associated array of data labels, called its index.
The simplest Series is formed from only an array of data:


In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
import random

pd.plotting.register_matplotlib_converters()
%matplotlib inline
plt.style.use('seaborn')
pd.set_option('display.max_columns', 500)
warnings.filterwarnings("ignore")

In [2]:
str1 = 'abcdefghij'.upper()
l1 = [str1[i:i+1] for i in range(0, len(str1), 1)]

# The string representation of a Series displayed interactively shows the index on the left and the values on the right.
s1 = pd.Series(np.random.randint(30, 99, 10), index = l1)
s1

A    58
B    87
C    90
D    37
E    76
F    31
G    65
H    58
I    93
J    36
dtype: int32

In [3]:
# You can get the array representation and index object of the Series via its values and index attributes, respectively:

print(f's1 values: {s1.values}')
print(f's1 index: {s1.index}')

s1 values: [58 87 90 37 76 31 65 58 93 36]
s1 index: Index(['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J'], dtype='object')


In [4]:
labels = ['a', 'b', 'c', 'd', 'e']
l = [x for x in range(1,6)]
values = ['apple', 'banana', 'cherry', 'guava', 'orange']
d = dict(zip(labels, values))
a = np.random.randint(1, 20, 5)

# Generating Series from the List 'l'
s1 = pd.Series(data = l)
print(f'series 1: \n{s1}')
print()

# Generating Series from the List 'l' with custom labels
s1 = pd.Series(data = l, index = labels)
print(f'series 1: \n{s1}')
print()

# Generating Series from the Dictionary 'd'
s2 = pd.Series(data = d)
print(f'series 2: \n{s2}')
print()

# Generating Series from the array 'a'
s3 = pd.Series(data = a, index = labels)
print(f'series 3: \n{s3}')
print()

# Generating Series from the array 'a' and labels
s4 = pd.Series(a, labels)
print(f'series 4: \n{s4}')

series 1: 
0    1
1    2
2    3
3    4
4    5
dtype: int64

series 1: 
a    1
b    2
c    3
d    4
e    5
dtype: int64

series 2: 
a     apple
b    banana
c    cherry
d     guava
e    orange
dtype: object

series 3: 
a     7
b    19
c    16
d     3
e     7
dtype: int32

series 4: 
a     7
b    19
c    16
d     3
e     7
dtype: int32


###  <font color = 'green'>Differences between ndarrays and Series Objects</font>

There are some differences worth noting between ndarrays and Series objects. 
 - First of all, elements in NumPy arrays are accessed by their integer position, starting with zero for the first element. A pandas Series Object is more flexible as you can use define your own labeled index to index and access elements of an array. You can also use letters instead of numbers, or number an array in descending order instead of ascending order. 
 - Second, aligning data from different Series and matching labels with Series objects is more efficient than using ndarrays, for example dealing with missing values. If there are no matching labels during alignment, pandas returns NaN (not any number) so that the operation does not fail.

## <font color = 'blue'>2.2 Series Properties: Index and Values of Series</font>

In [5]:
s = pd.Series(data = np.random.randint(20, 50, 6), index = ['a', 'b', 'c', 'd', 'e', 'f'])
print(f's: \n{s}')
print(f'Index: {s.index}')
print(f'Values: {s.values}')

s: 
a    46
b    40
c    48
d    28
e    45
f    41
dtype: int32
Index: Index(['a', 'b', 'c', 'd', 'e', 'f'], dtype='object')
Values: [46 40 48 28 45 41]


In [6]:
s['b']

40

In [7]:
# Here ['b', 'c', 'd'] is interpreted as a list of indices, even though it contains strings instead of integers.
s[['b', 'c', 'd']] # List works, but not tuple.

b    40
c    48
d    28
dtype: int32

In [8]:
# Function for converting strings to Lists:
def listify(string):
    newstr = string.upper()
    labels = [newstr[i:i+1] for i in range(0, len(string), 1)]
    return labels
print(listify('abcd'))

['A', 'B', 'C', 'D']


In [9]:
# Using NumPy functions or NumPy-like operations, such as filtering with a boolean array, scalar multiplication, or applying math functions, will preserve the index-value link:

a = np.random.randint(10, 40, 5)
print(f'a: {a}')
print()
print(f'a > 15: {a[a > 15]}')
print()
print(f'a*2: {a*2}')
print()

string = 'abcde'.upper()
labels = [string[i:i+1] for i in range(0, len(string), 1)]
s = pd.Series(a, index = labels)
print(f's: \n{s}')
print()
print(f's + 2: \n{s + 2}')

a: [34 30 13 16 38]

a > 15: [34 30 16 38]

a*2: [68 60 26 32 76]

s: 
A    34
B    30
C    13
D    16
E    38
dtype: int32

s + 2: 
A    36
B    32
C    15
D    18
E    40
dtype: int32


In [10]:
# Operations are then also done based off of index:
b = np.random.randint(10, 40, 5)
s2 = pd.Series(b, index = listify('cdefg'))
s + s2

A     NaN
B     NaN
C    50.0
D    38.0
E    69.0
F     NaN
G     NaN
dtype: float64

In [11]:
# Another way to think about a Series is as a fixed-length, ordered dict, as it is a mapping of index values to data values. It can be used in many contexts where you might use a dict:

32 in s

False

In [12]:
students = ['John', 'Mary', 'Rose', 'Harry', 'Alfred', 'Rick', 'Diana']
marks = np.random.randint(50, 80, 7)
d = dict(zip(students, marks))

print(f'marks: {marks}')
print()

s1 = pd.Series(d)
print(s1)
print()

s2 = pd.Series(d, index = ['John', 'Mary', 'Rose', 'Mark', 'David',])
s2

marks: [71 52 55 74 59 57 69]

John      71
Mary      52
Rose      55
Harry     74
Alfred    59
Rick      57
Diana     69
dtype: int64



John     71.0
Mary     52.0
Rose     55.0
Mark      NaN
David     NaN
dtype: float64

In [13]:
s2.isnull()

John     False
Mary     False
Rose     False
Mark      True
David     True
dtype: bool

In [14]:
students = ['John', 'Mary', 'Rose', 'Harry', 'Alfred', 'Rick', 'Diana']
marks = np.random.randint(50, 80, 7)
d = dict(zip(students, marks))
s = pd.Series(d)

print(f'marks: {marks}')
print()

s.name = 'Marks'
s.index.name = 'Students'
s

# Series Index re-assignment:
s.index = ['a', 'b', 'c', 'd', 'e', 'f', 'g']
print(s)

marks: [58 79 55 50 79 53 60]

a    58
b    79
c    55
d    50
e    79
f    53
g    60
Name: Marks, dtype: int64


# <font color = 'green'>Part III: Pandas Dataframe Objects</font>
## <font color = 'blue'>3.1 DataFrame Objects</font>

A DataFrame represents a rectangular table of data and contains an ordered collection
of columns, each of which can be a different value type (numeric, string,
boolean, etc.). <font color = 'blue'>**The DataFrame has both a row and column index; it can be thought of
as a dict of Series all sharing the same index.**</font> Under the hood, the data is stored as one
or more two-dimensional blocks rather than a list, dict, or some other collection of
one-dimensional arrays.

There are many ways to construct a DataFrame, though one of the most common is
from a dict of equal-length lists or NumPy arrays:

In [15]:
# Data-frame constructor:
data = { 
        'Name': ['John', 'Mary', 'Rose', 'Harry', 'Alfred', 'Rick', 'Diana'],
        'Math': np.random.randint(50, 90, 7),
        'Physics': np.random.randint(50, 90, 7),
        'Chemistry': np.random.randint(50, 90, 7),
        'English': np.random.randint(50, 90, 7),
        'History': np.random.randint(50, 90, 7), 
        'Geography': np.random.randint(50, 90, 7),
        }

df = pd.DataFrame(data)
df

Unnamed: 0,Name,Math,Physics,Chemistry,English,History,Geography
0,John,63,81,77,77,58,62
1,Mary,59,82,66,58,57,73
2,Rose,61,53,73,80,87,61
3,Harry,73,62,54,81,64,69
4,Alfred,51,50,67,75,86,72
5,Rick,65,59,61,70,77,83
6,Diana,80,83,52,60,61,78


In [16]:
# Creating a Student - Marks Dataframe
import numpy as np
import pandas as pd

data = { 
        'Name': ['John', 'Mary', 'Rose', 'Harry', 'Alfred', 'Rick', 'Diana'],
        'Math': np.random.randint(50, 90, 7),
        'Physics': np.random.randint(50, 90, 7),
        'Chemistry': np.random.randint(50, 90, 7),
        'English': np.random.randint(50, 90, 7),
        'History': np.random.randint(50, 90, 7), 
        'Geography': np.random.randint(50, 90, 7),
        }

df = pd.DataFrame(data, index = [i for i in range(1, len(data)+1, 1)])
df

Unnamed: 0,Name,Math,Physics,Chemistry,English,History,Geography
1,John,88,57,84,64,60,78
2,Mary,69,63,63,77,79,79
3,Rose,70,65,55,53,56,68
4,Harry,70,59,84,83,53,56
5,Alfred,89,80,86,66,78,69
6,Rick,83,73,84,77,52,65
7,Diana,56,75,51,75,54,56


# <font color = 'green'>Part IV: Pandas Dataframe Properties</font>
## <font color = 'blue'>4.1 DataFrame Properties</font>

In [17]:
# Pass a list of column names in any order necessary

df[['Name', 'English', 'Math']]

Unnamed: 0,Name,English,Math
1,John,64,88
2,Mary,77,69
3,Rose,53,70
4,Harry,83,70
5,Alfred,66,89
6,Rick,77,83
7,Diana,75,56


In [18]:
# Dataframe columns are nothing but series.
type(df['English'])

pandas.core.series.Series

In [19]:
# If you specify a sequence of columns, the DataFrame’s columns will be arranged in that order:

pd.DataFrame(data, columns = ['English', 'Math'])

Unnamed: 0,English,Math
0,64,88
1,77,69
2,53,70
3,83,70
4,66,89
5,77,83
6,75,56


In [20]:
# A column in a DataFrame can be retrieved as a Series either by dict-like notation or by attribute:

df['Name']

1      John
2      Mary
3      Rose
4     Harry
5    Alfred
6      Rick
7     Diana
Name: Name, dtype: object

In [21]:
# Attribute-like access (e.g., df.year) and tab completion of column names in IPython is provided as a convenience.
# DF[column] works for any column name, but df.column only works when the column name is a valid Python variable name.

df.Name

# Note that the returned Series have the same index as the DataFrame, and their name attribute has been appropriately set.

1      John
2      Mary
3      Rose
4     Harry
5    Alfred
6      Rick
7     Diana
Name: Name, dtype: object

In [22]:
# Rows can also be retrieved by position or name with the special loc attribute (much more on this later):

data = { 
        'Name': ['John', 'Mary', 'Rose', 'Harry', 'Alfred', 'Rick', 'Diana'],
        'Math': np.random.randint(50, 90, 7),
        'Physics': np.random.randint(50, 90, 7),
        'Chemistry': np.random.randint(50, 90, 7),
        'English': np.random.randint(50, 90, 7),
        'History': np.random.randint(50, 90, 7), 
        'Geography': np.random.randint(50, 90, 7),
        }

df = pd.DataFrame(data, index = [i for i in range(1, len(data)+1, 1)])
df.iloc[:, 2:4]

Unnamed: 0,Physics,Chemistry
1,65,78
2,76,83
3,82,51
4,87,81
5,86,65
6,66,62
7,56,59


## <font color = 'blue'>4.2 DataFrame Properties II: Selection, Indexing and Methods</fonts>

In [23]:
import numpy as np
import pandas as pd

data = { 
        'Name': ['John', 'Mary', 'Rose', 'Harry', 'Alfred', 'Rick', 'Diana'],
        'Math': np.random.randint(50, 90, 7),
        'Physics': np.random.randint(50, 90, 7),
        'Chemistry': np.random.randint(50, 90, 7),
        'English': np.random.randint(50, 90, 7),
        'History': np.random.randint(50, 90, 7), 
        'Geography': np.random.randint(50, 90, 7),
        }

df = pd.DataFrame(data, index = [i for i in range(1, len(data)+1, 1)])
df

Unnamed: 0,Name,Math,Physics,Chemistry,English,History,Geography
1,John,71,54,71,69,50,66
2,Mary,58,86,66,66,70,73
3,Rose,64,74,89,81,70,59
4,Harry,59,60,52,84,71,50
5,Alfred,65,87,69,72,58,74
6,Rick,79,70,62,63,76,64
7,Diana,50,72,70,88,73,60


In [24]:
# Pass a list of column names in any order necessary
df[['English', 'Math', 'Physics']]

Unnamed: 0,English,Math,Physics
1,69,71,54
2,66,58,86
3,81,64,74
4,84,59,60
5,72,65,87
6,63,79,70
7,88,50,72


### Reading files: the <code>read_csv()</code> method:

In [25]:
# Python program to convert .txt file into a .csv file
import csv

with open('entities.txt', 'r') as in_file:
    stripped = (line.strip() for line in in_file)
    lines = (line.split(",") for line in stripped if line)
    with open('entities.csv', 'w', newline='') as out_file:
        writer = csv.writer(out_file)
        writer.writerows(lines)

In [26]:
# Using pandas to load a .txt file into a DataFrame

import pandas as pd

df2 = pd.read_csv('entities.txt', delimiter=",", names = ['Entity', 'Name', 'Syn1', 'Syn2', 'Syn3', 'Syn4', 'Syn5', 'Syn6', 'Syn7', 'Syn8', 'Syn9', 'Syn10', 'Syn11', 'Syn12'], header = None)
df2.to_csv ('watson_entities.csv', index = None)
df2[['Entity', 'Name', 'Syn1', 'Syn2', 'Syn3']].head()

Unnamed: 0,Entity,Name,Syn1,Syn2,Syn3
0,occasion,Christmas,Yule,December 25th,Dec 25
1,occasion,Graduation,commencement,graduate,grad
2,occasion,Wedding,bridesmaid,groom,bride
3,occasion,Congratulations,success,award,
4,occasion,Mother's Day,Mom's Day,Mum's Day,Mothering Sunday


### Calculating Mean: the <code>mean()</code> method:

In [27]:
# df = df.assign(Average=df.mean(axis=1))
# df.drop('Average', axis=1,inplace=True)

# Adding a column to DataFrame:
df['CSc'] = np.random.randint(50, 90, 7)
df

Unnamed: 0,Name,Math,Physics,Chemistry,English,History,Geography,CSc
1,John,71,54,71,69,50,66,61
2,Mary,58,86,66,66,70,73,74
3,Rose,64,74,89,81,70,59,62
4,Harry,59,60,52,84,71,50,57
5,Alfred,65,87,69,72,58,74,80
6,Rick,79,70,62,63,76,64,51
7,Diana,50,72,70,88,73,60,57


In [28]:
# Re-ordering columns:
df = df[['Name', 'Math', 'Physics', 'Chemistry', 'CSc', 'English', 'History', 'Geography']]
df

Unnamed: 0,Name,Math,Physics,Chemistry,CSc,English,History,Geography
1,John,71,54,71,61,69,50,66
2,Mary,58,86,66,74,66,70,73
3,Rose,64,74,89,62,81,70,59
4,Harry,59,60,52,57,84,71,50
5,Alfred,65,87,69,80,72,58,74
6,Rick,79,70,62,51,63,76,64
7,Diana,50,72,70,57,88,73,60


In [29]:
# Adding rows: Finding row - average:
df = df.assign(Average=df.mean(axis=1))
df

Unnamed: 0,Name,Math,Physics,Chemistry,CSc,English,History,Geography,Average
1,John,71,54,71,61,69,50,66,63.142857
2,Mary,58,86,66,74,66,70,73,70.428571
3,Rose,64,74,89,62,81,70,59,71.285714
4,Harry,59,60,52,57,84,71,50,61.857143
5,Alfred,65,87,69,80,72,58,74,72.142857
6,Rick,79,70,62,51,63,76,64,66.428571
7,Diana,50,72,70,57,88,73,60,67.142857


### Rounding Numbers: the <code>round()</code> method:

In [30]:
# Rounding off values: the round() method:
df = df.round(2)
df

Unnamed: 0,Name,Math,Physics,Chemistry,CSc,English,History,Geography,Average
1,John,71,54,71,61,69,50,66,63.14
2,Mary,58,86,66,74,66,70,73,70.43
3,Rose,64,74,89,62,81,70,59,71.29
4,Harry,59,60,52,57,84,71,50,61.86
5,Alfred,65,87,69,80,72,58,74,72.14
6,Rick,79,70,62,51,63,76,64,66.43
7,Diana,50,72,70,57,88,73,60,67.14


In [31]:
# More Examples of the round method: Rounding columns individually

# setting the seed to re-create the dataframe 
np.random.seed(25)

# Creating a 5 * 4 dataframe  
df2 = pd.DataFrame(np.random.random([5, 4]), columns =["A", "B", "C", "D"])

# Print the dataframe 
df2

Unnamed: 0,A,B,C,D
0,0.870124,0.582277,0.278839,0.185911
1,0.4111,0.117376,0.684969,0.437611
2,0.556229,0.36708,0.402366,0.113041
3,0.447031,0.585445,0.161985,0.520719
4,0.326051,0.699186,0.366395,0.836375


In [32]:
# round off the columns in this manner 
# "A" to 1 decimal place 
# "B" to 2 decimal place 
# "C" to 3 decimal place 
# "D" to 4 decimal place 
  
df2 = df2.round({"A":1, "B":2, "C":3, "D":4}) 
df2

Unnamed: 0,A,B,C,D
0,0.9,0.58,0.279,0.1859
1,0.4,0.12,0.685,0.4376
2,0.6,0.37,0.402,0.113
3,0.4,0.59,0.162,0.5207
4,0.3,0.7,0.366,0.8364


## <font color = 'blue'>4.3 DataFrame Properties III: Removing Rows and Columns</fonts>

In [33]:
df

Unnamed: 0,Name,Math,Physics,Chemistry,CSc,English,History,Geography,Average
1,John,71,54,71,61,69,50,66,63.14
2,Mary,58,86,66,74,66,70,73,70.43
3,Rose,64,74,89,62,81,70,59,71.29
4,Harry,59,60,52,57,84,71,50,61.86
5,Alfred,65,87,69,80,72,58,74,72.14
6,Rick,79,70,62,51,63,76,64,66.43
7,Diana,50,72,70,57,88,73,60,67.14


In [34]:
# Removing columns:
# Use axis=0 for dropping rows and axis=1 for dropping columns
# column is not dropped unless inplace input is TRUE

df['CSc'] = np.random.randint(50, 100, 7)
df['Sports'] = np.random.randint(50, 100, 7)
df = df[['Name', 'Math', 'Physics', 'Chemistry', 'CSc', 'English', 'History', 'Geography', 'Sports', 'Average']]
# Note that Average is already asssigned and not being computed.
df

Unnamed: 0,Name,Math,Physics,Chemistry,CSc,English,History,Geography,Sports,Average
1,John,71,54,71,59,69,50,66,77,63.14
2,Mary,58,86,66,82,66,70,73,99,70.43
3,Rose,64,74,89,60,81,70,59,68,71.29
4,Harry,59,60,52,97,84,71,50,78,61.86
5,Alfred,65,87,69,92,72,58,74,86,72.14
6,Rick,79,70,62,51,63,76,64,84,66.43
7,Diana,50,72,70,72,88,73,60,80,67.14


In [35]:
df.drop(['CSc', 'Sports'], axis = 1, inplace = True)
df

Unnamed: 0,Name,Math,Physics,Chemistry,English,History,Geography,Average
1,John,71,54,71,69,50,66,63.14
2,Mary,58,86,66,66,70,73,70.43
3,Rose,64,74,89,81,70,59,71.29
4,Harry,59,60,52,84,71,50,61.86
5,Alfred,65,87,69,72,58,74,72.14
6,Rick,79,70,62,63,76,64,66.43
7,Diana,50,72,70,88,73,60,67.14


In [36]:
# Dropping rows with axis = 0
df.drop([6], axis = 0, inplace = True)
df

Unnamed: 0,Name,Math,Physics,Chemistry,English,History,Geography,Average
1,John,71,54,71,69,50,66,63.14
2,Mary,58,86,66,66,70,73,70.43
3,Rose,64,74,89,81,70,59,71.29
4,Harry,59,60,52,84,71,50,61.86
5,Alfred,65,87,69,72,58,74,72.14
7,Diana,50,72,70,88,73,60,67.14


### Selecting Rows
#### Source:
 - [Pandas Doc | df.loc[..]](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html)
 - [Pandas Doc | Label based Indexing](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#indexing-label)
 
DataFrame.loc
    Access a group of rows and columns by label(s) or a boolean array.

    .loc[] is primarily label based, but may also be used with a boolean array.

Allowed inputs are:

  - A single label, e.g. 5 or 'a', (note that 5 is interpreted as a label of the index, and never as an integer position along the index).
 - A list or array of labels, e.g. <code>['a', 'b', 'c'].</code>
 - A slice object with labels, e.g. <code>'a':'f'.</code>

Warning: Note that contrary to usual python slices, both the start and the stop are included.

 - A boolean array of the same length as the axis being sliced, e.g. <code>[True, False, True].</code>

 - A callable function with one argument (the calling Series or DataFrame) and that returns valid output for indexing (one of the above)

In [37]:
# Select based on Index 
df.loc[2]

Name          Mary
Math            58
Physics         86
Chemistry       66
English         66
History         70
Geography       73
Average      70.43
Name: 2, dtype: object

In [38]:
# Select based on iLoc (true index position) instead of label
df.iloc[1]

Name          Mary
Math            58
Physics         86
Chemistry       66
English         66
History         70
Geography       73
Average      70.43
Name: 2, dtype: object

In [39]:
df.loc[:, ['Name', 'Math', 'Physics', 'Chemistry']]

Unnamed: 0,Name,Math,Physics,Chemistry
1,John,71,54,71
2,Mary,58,86,66
3,Rose,64,74,89
4,Harry,59,60,52
5,Alfred,65,87,69
7,Diana,50,72,70


In [40]:
df.loc[1:3]

Unnamed: 0,Name,Math,Physics,Chemistry,English,History,Geography,Average
1,John,71,54,71,69,50,66,63.14
2,Mary,58,86,66,66,70,73,70.43
3,Rose,64,74,89,81,70,59,71.29


In [41]:
df.loc[[1, 2, 3], ['Name', 'Math', 'Physics', 'Chemistry']] # 1, 2, 3 are labels, not indices, check source.

Unnamed: 0,Name,Math,Physics,Chemistry
1,John,71,54,71
2,Mary,58,86,66
3,Rose,64,74,89


In [42]:
# Adding a row.
df.loc[6] = ['Richard', 87, 58, 59, 48, 58, 91, 82.50]
df = df.sort_index()
df

Unnamed: 0,Name,Math,Physics,Chemistry,English,History,Geography,Average
1,John,71,54,71,69,50,66,63.14
2,Mary,58,86,66,66,70,73,70.43
3,Rose,64,74,89,81,70,59,71.29
4,Harry,59,60,52,84,71,50,61.86
5,Alfred,65,87,69,72,58,74,72.14
6,Richard,87,58,59,48,58,91,82.5
7,Diana,50,72,70,88,73,60,67.14


In [43]:
# Selecting a subset of rows and columns.
# Watch out: confusing [,] vs : in df.loc[] argument
df2 = df.loc[2:4, 'Math':'Average']
df2

Unnamed: 0,Math,Physics,Chemistry,English,History,Geography,Average
2,58,86,66,66,70,73,70.43
3,64,74,89,81,70,59,71.29
4,59,60,52,84,71,50,61.86


## <font color = 'blue'>4.4 DataFrame Properties IV: Conditional Selection</font>

In [44]:
df2 > 60

Unnamed: 0,Math,Physics,Chemistry,English,History,Geography,Average
2,False,True,True,True,True,True,True
3,True,True,True,True,True,False,True
4,False,False,False,True,True,False,True


In [45]:
df2[df2 > 60]

Unnamed: 0,Math,Physics,Chemistry,English,History,Geography,Average
2,,86.0,66.0,66,70,73.0,70.43
3,64.0,74.0,89.0,81,70,,71.29
4,,,,84,71,,61.86


In [46]:
df2[df2['Math'] > 70]

Unnamed: 0,Math,Physics,Chemistry,English,History,Geography,Average


In [47]:
# dataframe[dataframe['Score1']>0.5][['Score2','Score3']]
df2 = df.loc[:, 'Math':'Average']
df2

Unnamed: 0,Math,Physics,Chemistry,English,History,Geography,Average
1,71,54,71,69,50,66,63.14
2,58,86,66,66,70,73,70.43
3,64,74,89,81,70,59,71.29
4,59,60,52,84,71,50,61.86
5,65,87,69,72,58,74,72.14
6,87,58,59,48,58,91,82.5
7,50,72,70,88,73,60,67.14


In [48]:
df2[df2['English'] > 60][['Math', 'Physics', 'Chemistry']] # Format: df2[row condition][column condition]

Unnamed: 0,Math,Physics,Chemistry
1,71,54,71
2,58,86,66
3,64,74,89
4,59,60,52
5,65,87,69
7,50,72,70


In [49]:
# For multiple conditions you can use | (OR) and & (AND) with parenthesis
df2[(df2['Math'] > 70) & (df2['English'] > 60)] # AND

Unnamed: 0,Math,Physics,Chemistry,English,History,Geography,Average
1,71,54,71,69,50,66,63.14


In [50]:
df2[(df2['Math'] > 70) | (df2['English'] > 60)] # OR

Unnamed: 0,Math,Physics,Chemistry,English,History,Geography,Average
1,71,54,71,69,50,66,63.14
2,58,86,66,66,70,73,70.43
3,64,74,89,81,70,59,71.29
4,59,60,52,84,71,50,61.86
5,65,87,69,72,58,74,72.14
6,87,58,59,48,58,91,82.5
7,50,72,70,88,73,60,67.14


## <font color = 'blue'>4.5 DataFrame Properties V: More Index Details</font>
Some more features of indexing includes 
  - resetting the index 
  - setting a different value
  - index hierarchy

In [51]:
df

Unnamed: 0,Name,Math,Physics,Chemistry,English,History,Geography,Average
1,John,71,54,71,69,50,66,63.14
2,Mary,58,86,66,66,70,73,70.43
3,Rose,64,74,89,81,70,59,71.29
4,Harry,59,60,52,84,71,50,61.86
5,Alfred,65,87,69,72,58,74,72.14
6,Richard,87,58,59,48,58,91,82.5
7,Diana,50,72,70,88,73,60,67.14


In [52]:
# Reset to default index value instead of 1 to 7
df.reset_index()

Unnamed: 0,index,Name,Math,Physics,Chemistry,English,History,Geography,Average
0,1,John,71,54,71,69,50,66,63.14
1,2,Mary,58,86,66,66,70,73,70.43
2,3,Rose,64,74,89,81,70,59,71.29
3,4,Harry,59,60,52,84,71,50,61.86
4,5,Alfred,65,87,69,72,58,74,72.14
5,6,Richard,87,58,59,48,58,91,82.5
6,7,Diana,50,72,70,88,73,60,67.14


In [56]:
def listify(string):
    newstr = string.upper()
    labels = [newstr[i:i+1] for i in range(0, len(string))]
    return labels

df.reset_index()
new_index = listify('abcdefg')
df['ID'] = new_index

# DataFrame’s set_index function will create a new DataFrame using one or more of its columns as the index:
df.set_index('ID')

Unnamed: 0_level_0,Name,Math,Physics,Chemistry,English,History,Geography,Average
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
A,John,71,54,71,69,50,66,63.14
B,Mary,58,86,66,66,70,73,70.43
C,Rose,64,74,89,81,70,59,71.29
D,Harry,59,60,52,84,71,50,61.86
E,Alfred,65,87,69,72,58,74,72.14
F,Richard,87,58,59,48,58,91,82.5
G,Diana,50,72,70,88,73,60,67.14


## <font color = 'blue'>4.6 DataFrame Properties VI: Function Application and Mapping</font>
Another frequent operation is applying a function on one-dimensional arrays to each
column or row. DataFrame’s <code>apply</code> method does exactly this:

In [57]:
# Create a DataFrame:
def df_maker(m, n):
    index = pd.Index([f'Row {i + 1}' for i in range(m)], name='Rows')
    columns = pd.Index([f'Col {i + 1}' for i in range(n)], name='Columns')
    df = pd.DataFrame(np.random.randint(1, 100, size=(m, n)), index=index, columns=columns)
    return df

df = df_maker(5, 5)
df

Columns,Col 1,Col 2,Col 3,Col 4,Col 5
Rows,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Row 1,39,3,6,58,74
Row 2,23,12,3,44,91
Row 3,62,85,84,75,18
Row 4,15,84,38,88,51
Row 5,52,15,72,63,91


In [58]:
f = lambda x: x.max() - x.mean()
df.apply(f)

Columns
Col 1    23.8
Col 2    45.2
Col 3    43.4
Col 4    22.4
Col 5    26.0
dtype: float64

In [59]:
df.apply(f, axis = 1)

Rows
Row 1    38.0
Row 2    56.4
Row 3    20.2
Row 4    32.8
Row 5    32.4
dtype: float64

In [60]:
# For Multi-Index arrays:
columns = pd.MultiIndex.from_arrays([['Student 1', 'Student 1', 'Student 1', 'Student 2', 'Student 2', 'Student 2'], ['Physics', 'Chemistry', 'Math', 'Physics', 'Chemistry', 'Math']], names=['Student', 'Subject'])
index = pd.Index([2016, 2017, 2018, 2019], name = 'Year')
marks = pd.DataFrame(np.random.randint(50, 100, 24).reshape(4, 6), index = index, columns = columns)
marks 

Student,Student 1,Student 1,Student 1,Student 2,Student 2,Student 2
Subject,Physics,Chemistry,Math,Physics,Chemistry,Math
Year,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
2016,65,64,62,89,74,96
2017,87,79,63,96,53,52
2018,94,94,95,74,59,53
2019,65,82,81,55,86,69


In [61]:
marks.apply(f)

Student    Subject  
Student 1  Physics      16.25
           Chemistry    14.25
           Math         19.75
Student 2  Physics      17.50
           Chemistry    18.00
           Math         28.50
dtype: float64

In [63]:
marks.apply(f, axis = 1).round(2)

Year
2016    21.00
2017    24.33
2018    16.83
2019    13.00
dtype: float64

In [65]:
f2 = lambda x: x.mean()
df.apply(f2).round(2)

Columns
Col 1    38.2
Col 2    39.8
Col 3    40.6
Col 4    65.6
Col 5    65.0
dtype: float64

In [67]:
# The function passed to apply need not return a scalar value; it can also return a Series with multiple values:
f3 = lambda x: pd.Series([x.min(), x.max()], index=['Min', 'Max'])

# We could also write the f3 function in long form as:
# def f3(x):
#     return pd.Series([x.min(), x.max()], index=['Min', 'Max'])

marks.apply(f3)

Student,Student 1,Student 1,Student 1,Student 2,Student 2,Student 2
Subject,Physics,Chemistry,Math,Physics,Chemistry,Math
Min,65,64,62,55,53,52
Max,94,94,95,96,86,96


In [68]:
marks.apply(f3, axis = 1)

Unnamed: 0_level_0,Min,Max
Year,Unnamed: 1_level_1,Unnamed: 2_level_1
2016,62,96
2017,52,96
2018,53,95
2019,55,86


Element-wise Python functions can be used, too. Suppose you wanted to compute a formatted string from each floating-point value in frame. You can do this with <code>applymap</code>:

In [69]:
format = lambda x: '%.2f' % x
marks.applymap(format)

Student,Student 1,Student 1,Student 1,Student 2,Student 2,Student 2
Subject,Physics,Chemistry,Math,Physics,Chemistry,Math
Year,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
2016,65.0,64.0,62.0,89.0,74.0,96.0
2017,87.0,79.0,63.0,96.0,53.0,52.0
2018,94.0,94.0,95.0,74.0,59.0,53.0
2019,65.0,82.0,81.0,55.0,86.0,69.0
