# **Lecture 7A**
# **Summarizing data in DataFrame**


In this part, we will learn how to summarizing data in a DataFrame. We will reuse the Excel file **student2.xlsx** from Lecture 6. Make sure that you have executed the following 2 cells before you get started with the examples.

In [1]:
# Run the code below to access files in your Google Drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [5]:
# import Pandas module
import pandas as pd

# Read XLSX file into DataFrame 
# We are reading "sheet1" from the file student.xlsx
datadf = pd.read_excel("/content/drive/MyDrive/Data/student2.xlsx",sheet_name="sheet1")
display(datadf)
print(datadf[["Math","English","Chinese"]])

Unnamed: 0,StudentID,Firstname,Lastname,Gender,Math,English,Chinese,Hobby,GPA,Scholarship,Loan
0,1,Amy,Chan,F,57,90,86,Chess/Swim,1.79,False,False
1,2,Betty,Lee,F,60,68,79,Swim/Football,0.58,False,True
2,3,Johnny,Lam,M,37,89,65,Music/Dance/Swim,1.83,False,False
3,4,Thomson,Ho,M,36,93,43,Reading/Dance,2.03,False,False
4,5,Mary,Cheng,F,35,38,80,Singing/Dance/Chess,1.78,False,True
5,6,Jerry,Li,M,54,99,37,Ping Pong/Swim/Cycling,3.32,True,False
6,7,Bob,Wong,M,88,36,26,Reading/Swim,2.81,False,True
7,8,Peter,Yeung,M,83,90,68,Gaming/Football,2.37,False,False
8,9,Clara,Yau,F,51,65,45,Football/Music/Art,3.02,True,True
9,10,Jacky,Lee,M,90,72,94,Art/Reading/Swim,3.89,True,False


   Math  English  Chinese
0    57       90       86
1    60       68       79
2    37       89       65
3    36       93       43
4    35       38       80
5    54       99       37
6    88       36       26
7    83       90       68
8    51       65       45
9    90       72       94


---
**Example 1:** We have learnt the use of **df.loc[*row,col*]** property to extract subset of data from a DataFrame. However, we need to know exactly what we need. In some situations, we want to extract data based on a ***condition***. For example, one may want to extract records of all female students but we don't know which row contains female students.

* **df.loc[*condition*]** will extract the rows in df DataFrame satisfying the condition in the square bracket.
* ***condition*** is a boolean expression involving columns (i.e. Series) in the DataFrame.
* When you are selecting rows by a ***condition***, you cannot select the column at the same time. You will select the column separately.
* If you need to combine multiple conditions, we need to use bitwise logical operators.
<p align = "left">
<table>
<tr><td>Operation</td><td>Bitwise Operator</td></tr>
<tr><td>AND</td><td>&</td></tr>
<tr><td>OR</td><td>|</td></tr>
<tr><td>NOT</td><td>~</td></tr>
</table>
</p>
* You can still use the usual comparison operators( **==**, **!=**, **<**, **<=**, **>**, **>=**)
* Beware that bitwise operator has higher priority than comparison operators.

In [None]:
# Extracting all female students into datadf1
datadf1 = datadf.loc[(datadf["Gender"]=="F")&(datadf["GPA"]>3)]
display(datadf1)

Unnamed: 0,StudentID,Firstname,Lastname,Gender,Math,English,Chinese,Hobby,GPA,Scholarship,Loan
8,9,Clara,Yau,F,51,65,45,Football/Music/Art,3.02,True,True


In [None]:
# Extracting students with math score greater than 80 into datadf2
datadf2 = datadf.loc[datadf["Math"]>80]
display(datadf2)

Unnamed: 0,StudentID,Firstname,Lastname,Gender,Math,English,Chinese,Hobby,GPA,Scholarship,Loan
6,7,Bob,Wong,M,88,36,26,Reading/Swim,2.81,False,True
7,8,Peter,Yeung,M,83,90,68,Gaming/Football,2.37,False,False
9,10,Jacky,Lee,M,90,72,94,Art/Reading/Swim,3.89,True,False


In [None]:
data999=datadf.loc[~(datadf["Gender"]=="F") & (datadf["GPA"]>1)]
display(data999)

Unnamed: 0,StudentID,Firstname,Lastname,Gender,Math,English,Chinese,Hobby,GPA,Scholarship,Loan
2,3,Johnny,Lam,M,37,89,65,Music/Dance/Swim,1.83,False,False
3,4,Thomson,Ho,M,36,93,43,Reading/Dance,2.03,False,False
5,6,Jerry,Li,M,54,99,37,Ping Pong/Swim/Cycling,3.32,True,False
6,7,Bob,Wong,M,88,36,26,Reading/Swim,2.81,False,True
7,8,Peter,Yeung,M,83,90,68,Gaming/Football,2.37,False,False
9,10,Jacky,Lee,M,90,72,94,Art/Reading/Swim,3.89,True,False


In [None]:
# Extracting students who have both Math score and English score greater than 60.
# Important: We are using bitwise operator (&,|,~). Do not use the keywords (and,or,not).
datadf3 = datadf.loc[(datadf["Math"]>60) & (datadf["English"]>60)]
display(datadf3)

Unnamed: 0,StudentID,Firstname,Lastname,Gender,Math,English,Chinese,Hobby,GPA,Scholarship,Loan
7,8,Peter,Yeung,M,83,90,68,Gaming/Football,2.37,False,False
9,10,Jacky,Lee,M,90,72,94,Art/Reading/Swim,3.89,True,False


In [None]:
# Extracting students who do not have scholarship but loan
# Important: We are using bitwise operator (&,|,~). Do not use the keywords (and,or,not).
datadf4 = datadf.loc[~datadf["Scholarship"] & datadf["Loan"]]
display(datadf4)

Unnamed: 0,StudentID,Firstname,Lastname,Gender,Math,English,Chinese,Hobby,GPA,Scholarship,Loan
1,2,Betty,Lee,F,60,68,79,Swim/Football,0.58,False,True
4,5,Mary,Cheng,F,35,38,80,Singing/Dance/Chess,1.78,False,True
6,7,Bob,Wong,M,88,36,26,Reading/Swim,2.81,False,True


---
**Example 2**: When you are working on large DataFrames, you don't want to "display" the entire DataFrame. Instead, you can choose to display the first couple rows or the last couple rows of the DataFrame.<br>
* Use **df.head(*n*)** to extract the first ***n*** rows from a DataFrame **df**. If ***n*** is not specified, first 5 rows will be returned.
* Use **df.tail(*n*)** to extract the last ***n*** rows from a DataFrame **df**. If ***n*** is not specified, last 5 rows will be returned.




In [None]:
# n is not provided
# Therefore, we are getting the first 5 rows.
display(datadf.head(1))

Unnamed: 0,StudentID,Firstname,Lastname,Gender,Math,English,Chinese,Hobby,GPA,Scholarship,Loan
0,1,Amy,Chan,F,57,90,86,Chess/Swim,1.79,False,False


In [None]:
# n is 3
# Therefore, we are getting first 3 rows.
display(datadf.head(3))

Unnamed: 0,StudentID,Firstname,Lastname,Gender,Math,English,Chinese,Hobby,GPA,Scholarship,Loan
0,1,Amy,Chan,F,57,90,86,Chess/Swim,1.79,False,False
1,2,Betty,Lee,F,60,68,79,Swim/Football,0.58,False,True
2,3,Johnny,Lam,M,37,89,65,Music/Dance/Swim,1.83,False,False


In [None]:
# n is not provided
# Therefore, we are getting the last 5 rows.
display(datadf.tail(3))

Unnamed: 0,StudentID,Firstname,Lastname,Gender,Math,English,Chinese,Hobby,GPA,Scholarship,Loan
7,8,Peter,Yeung,M,83,90,68,Gaming/Football,2.37,False,False
8,9,Clara,Yau,F,51,65,45,Football/Music/Art,3.02,True,True
9,10,Jacky,Lee,M,90,72,94,Art/Reading/Swim,3.89,True,False


In [None]:
# n is 4
# Therefore, we are getting the last 4 rows.
display(datadf.tail(4))

Unnamed: 0,StudentID,Firstname,Lastname,Gender,Math,English,Chinese,Hobby,GPA,Scholarship,Loan
6,7,Bob,Wong,M,88,36,26,Reading/Swim,2.81,False,True
7,8,Peter,Yeung,M,83,90,68,Gaming/Football,2.37,False,False
8,9,Clara,Yau,F,51,65,45,Football/Music/Art,3.02,True,True
9,10,Jacky,Lee,M,90,72,94,Art/Reading/Swim,3.89,True,False


---
**Example 3:** Pandas provides many statistical functions for summarizing data in a column. Suppose **df** is a DataFrame. Here are some basic statistical functions that you can use on a column.<br>
* **pd.mean()** will calculate the means of all *numeric* columns in the DataFrame.
* **pd.std()** will calculate the sample standard deviations of all *numeric* columns in the DataFrame.
* **pd.var()** will calculate the sample variances of all *numeric* columns in the DataFrame.
* **pd.median()** will calculate the medians of all *numeric* columns in the DataFrame.
* **pd.quantile(q=*p*)** will calculate the percentile of all *numeric* columns in the DataFrame. The value p (0 to 1) is used to specify the percentile requested. E.g. p=0.25 means 25th percentile.
* **pd.min()** will calculate the minimums of all *numeric* columns in the DataFrame.
* **pd.max()** will calculate the maximums of all *numeric* columns in the DataFrame.
* There are MANY functions available. See the official manual at (https://pandas.pydata.org/docs/reference/frame.html).
* If the DataFrame **df** contains non-numeric columns, they will be ignored in the calculations. However, it is not a good practice to rely on this.





In [None]:
# Create a DataFrame containing the subject scores
score = datadf[["Math","English","Chinese"]]

# Find the means of all 3 columns
print(score.mean())

Math       59.1
English    74.0
Chinese    62.3
dtype: float64


In [None]:
# Create a DataFrame containing the subject scores
score = datadf[["Math","English","Chinese"]]

# Find the sampe standard deviations of all 3 columns
print(score.std())   

Math       21.241730
English    22.568415
Chinese    23.161510
dtype: float64


In [None]:
# Create a DataFrame containing the subject scores
score = datadf[["Math","English","Chinese"]]

# Find the sampe variances of all 3 columns
print(score.var())

Math       451.211111
English    509.333333
Chinese    536.455556
dtype: float64


In [None]:
# Create a DataFrame containing the subject scores
score = datadf[["Math","English","Chinese"]]

# Find the minimums of all 3 columns
print(score.min())

Math       35
English    36
Chinese    26
dtype: int64


In [None]:
# Create a DataFrame containing the subject scores
score = datadf[["Math","English","Chinese"]]

# Find the maximums of all 3 columns
print(score.max())

Math       90
English    99
Chinese    94
dtype: int64


In [None]:
# Create a DataFrame containing the subject scores
score = datadf[["Math","English","Chinese"]]

# Find the medians of all 3 columns
print(score.median())

Math       55.5
English    80.5
Chinese    66.5
dtype: float64


In [None]:
# Create a DataFrame containing the subject scores
score = datadf[["Math","English","Chinese"]]

# Find the 25th percentiles of all 3 columns
print(score.quantile(q=0.25))

Math       40.50
English    65.75
Chinese    43.50
Name: 0.25, dtype: float64


In [None]:
# Create a DataFrame containing the subject scores
score = datadf[["Math","English","Chinese"]]

# Find the 75th percentiles of all 3 columns
print(score.quantile(q=0.75))

Math       77.25
English    90.00
Chinese    79.75
Name: 0.75, dtype: float64


In [None]:
# Create a DataFrame containing some non-numeric columns
score = datadf[["Firstname","Lastname","Math","English","Chinese"]]

# Find the mean of numeric columns
print(score.mean())

Math       59.1
English    74.0
Chinese    62.3
dtype: float64


  """


---
**Example 4:** By default, those functions in Example 3 calculate the statisics by "columns". You can also calculate the statistics by "rows".<br>
* Suppose **df** is a DataFrame with some numeric columns.
* **df.mean(axis=0)** calculates the mean of each column. This is the default when **axis=** is not used.
* **df.mean(axis=1)** calculates the mean of each row.
* The use of **axis=0/1** is the same for all other functions.


In [None]:
# Create a DataFrame containing the subject scores
score = datadf[["Math","English","Chinese"]]

# Find the means of each column
# i.e. the mean scores of each subject
print(score.mean(axis=1))
display(score)

0    77.666667
1    69.000000
2    63.666667
3    57.333333
4    51.000000
5    63.333333
6    50.000000
7    80.333333
8    53.666667
9    85.333333
dtype: float64


Unnamed: 0,Math,English,Chinese
0,57,90,86
1,60,68,79
2,37,89,65
3,36,93,43
4,35,38,80
5,54,99,37
6,88,36,26
7,83,90,68
8,51,65,45
9,90,72,94


In [None]:
# Find the means of each row
# i.e. the average score of the 3 subjects
#      for each student
print(score.mean(axis=1))

0    77.666667
1    69.000000
2    63.666667
3    57.333333
4    51.000000
5    63.333333
6    50.000000
7    80.333333
8    53.666667
9    85.333333
dtype: float64


---
**Example 5:** When exploring your data, you want to get basic statistics of multiple variables quickly. 
* If **df** is a DataFrame, **df.describe()** will summarize all numeric columns in the DataFrame using several commonly used statistics.
* By default, it will calculate count, mean, standard deviation, minimum, 25th percentile, 50th percentile, 75th percentile and maximum.
* Other percentiles can be requested by using the option **percentile=*list_of_proportions***.


In [None]:
# Create a DataFrame containing the subject scores
score = datadf[["Math","English","Chinese"]]

# Calculate summary statistics by uisng describe()
display(score.describe())


Unnamed: 0,Math,English,Chinese
count,10.0,10.0,10.0
mean,59.1,74.0,62.3
std,21.24173,22.568415,23.16151
min,35.0,36.0,26.0
25%,40.5,65.75,43.5
50%,55.5,80.5,66.5
75%,77.25,90.0,79.75
max,90.0,99.0,94.0


In [None]:
# Create a DataFrame containing the subject scores
score = datadf[["Math","English","Chinese"]]

# Calculate summary statistics by uisng describe()
display(score.describe(percentiles=[0.1,0.2,0.5,0.8,0.9]))

Unnamed: 0,Math,English,Chinese
count,10.0,10.0,10.0
mean,59.1,74.0,62.3
std,21.24173,22.568415,23.16151
min,35.0,36.0,26.0
10%,35.9,37.8,35.9
20%,36.8,59.6,41.8
50%,55.5,80.5,66.5
80%,84.0,90.6,81.2
90%,88.2,93.6,86.8
max,90.0,99.0,94.0


---
**Example 6:** If you are summarizing categorical columns like Gender, Scholarship and Loan, numerical statistics like mean, standard deviation are not suitable.<br>

Suppose **s** is a Series containing a categorical variable such as Gender.
* **s.value_counts()** will count the frequency of each category in the Series.
* **s.mode()** will find the category with the highest frequeny in the Series.

Suppose **df** is a DataFrame containing multiple categorical variables.
* **df.value_counts()** will count the frequency of each category combinations of the columns.
* **df.mode()** will find the category combination with the highest frequeny in the DataFrame.


In [None]:
# Calculate the frequency of each gender. 
# Note that datadf["Gender"] is a Series
print(datadf["Gender"].value_counts())
print(datadf["Gender"].mode())

M    6
F    4
Name: Gender, dtype: int64
0    M
dtype: object


In [None]:
# Find no. of True & False in Loan column. 
# Note that datadf["Loan"] is a Series
print(datadf["Loan"].value_counts())
print(datadf["Loan"].mode())

False    6
True     4
Name: Loan, dtype: int64
0    False
dtype: bool


In [None]:
# df_gl is a DataFrame containing 2 columns
columns = ["Gender","Loan"]
df_gl = datadf[columns]

# Find the frequency of each Gender/Loan combinations
# and the most common combination
print(df_gl.value_counts())
print(df_gl.mode())


Gender  Loan 
M       False    5
F       True     3
        False    1
M       True     1
dtype: int64
  Gender   Loan
0      M  False


---
**Example 7:** The **describe()** method we talked about earlier can also be used for categorical columns. However, different kind of statistics will be produced. Below are the list of statistics produced.

* **count** is the total number of non-missing data values in the column.
* **unique** is the number of distinct data values in the column.
* **top** is the data value that has the highest frequency in the column.
* **freq** is the frequency of the most common data value in the column.

In [None]:
# Create a DataFrame containing Categorical columns
display(datadf)
catdf = datadf[["Gender","Scholarship","Loan"]]

# Produce the summary statistics
a=catdf.describe()
print(a.loc[a["English"]<40])

Unnamed: 0,StudentID,Firstname,Lastname,Gender,Math,English,Chinese,Hobby,GPA,Scholarship,Loan
0,1,Amy,Chan,F,57,90,86,Chess/Swim,1.79,False,False
1,2,Betty,Lee,F,60,68,79,Swim/Football,0.58,False,True
2,3,Johnny,Lam,M,37,89,65,Music/Dance/Swim,1.83,False,False
3,4,Thomson,Ho,M,36,93,43,Reading/Dance,2.03,False,False
4,5,Mary,Cheng,F,35,38,80,Singing/Dance/Chess,1.78,False,True
5,6,Jerry,Li,M,54,99,37,Ping Pong/Swim/Cycling,3.32,True,False
6,7,Bob,Wong,M,88,36,26,Reading/Swim,2.81,False,True
7,8,Peter,Yeung,M,83,90,68,Gaming/Football,2.37,False,False
8,9,Clara,Yau,F,51,65,45,Football/Music/Art,3.02,True,True
9,10,Jacky,Lee,M,90,72,94,Art/Reading/Swim,3.89,True,False


KeyError: ignored

---
**Example 8:** If you want to use the summary statistics produced by various function, you need to pay attention to the data type (or struture) of the result.
* **df[*column_name*].mean()** and similar functions will produce a **numeric value**.
* **df[*list_of_column_names*].mean()** will produce a **Series**.
* **df[*column_name*].describe()** will produce a **Series**.
* **df[*list_of_column_names*].describe()** will produce a **DataFrame**.
* **df[*column_name*].value_counts()** will produce a **Series**.
* **df[*list_of_column_names*].value_counts()** will produce a **Series**
* **df[*column_name*].mode()** will produce a **Series**.
* **df[*list_of_column_names*].mode()** will produce a **DataFrame**







In [None]:
# Using numeric functions on 1 column gives a float value
result = datadf["Math"].mean()
print(type(result))
print(result)

<class 'numpy.float64'>
59.1


In [None]:
# Using numeric functions on a list of columns gives a Series
result = datadf[["Math","English","Chinese"]].mean()
print(type(result))
print(result)

# You can access the result by row index or name index
print(result["English"])
print(result[1])

<class 'pandas.core.series.Series'>
Math       59.1
English    74.0
Chinese    62.3
dtype: float64
74.0
74.0


In [None]:
# Using describe() function on 1 column gives a Series
result = datadf["Math"].describe()
print(type(result))
print(result)

# You can access the result by row index or name index
print(result["mean"])
print(result[1])

<class 'pandas.core.series.Series'>
count    10.00000
mean     59.10000
std      21.24173
min      35.00000
25%      40.50000
50%      55.50000
75%      77.25000
max      90.00000
Name: Math, dtype: float64
59.1
59.1


In [None]:
# Using describe() function on a list of columns gives a DataFrame
result = datadf[["Math","English","Chinese"]].describe()
print(type(result))
print(result)

# You can access the result by using .loc() of DataFrame
# The row index should be the name of the statistics
print(result.loc["count","Math"])
print(result.loc["max","English"])

<class 'pandas.core.frame.DataFrame'>
           Math    English   Chinese
count  10.00000  10.000000  10.00000
mean   59.10000  74.000000  62.30000
std    21.24173  22.568415  23.16151
min    35.00000  36.000000  26.00000
25%    40.50000  65.750000  43.50000
50%    55.50000  80.500000  66.50000
75%    77.25000  90.000000  79.75000
max    90.00000  99.000000  94.00000
10.0
99.0


In [None]:
# Using value_counts() on 1 column gives a Series
result = datadf["Gender"].value_counts()
print(type(result))
print(result)

# You can access it by row index or name index
print(result[1])
print(result["F"])

<class 'pandas.core.series.Series'>
M    6
F    4
Name: Gender, dtype: int64
4
4


In [None]:
# Using value_counts() on a list of columns gives a Series
result = datadf[["Gender","Loan"]].value_counts()
print(type(result))
print(result)

# You can access it by row index or name multi-index
print(result[0])
print(result[("M",False)])

<class 'pandas.core.series.Series'>
Gender  Loan 
M       False    5
F       True     3
        False    1
M       True     1
dtype: int64
5
5


In [None]:
# Using mode() on 1 column gives a Series
# Note that you can have more than 1 mode for categorical column!
result = datadf["Gender"].mode()
print(type(result))
print(result)

# You can access it by row index
print(result[0])

<class 'pandas.core.series.Series'>
0    M
dtype: object
M


In [None]:
# Using mode() on a list of columns gives a DataFrame
# Note that you can have more than 1 mode for categorical columns!
result = datadf[["Gender","Loan"]].mode()
print(type(result))
print(result)

# You can access the mode by .loc()
print(result.loc[0])

<class 'pandas.core.frame.DataFrame'>
  Gender   Loan
0      M  False
Gender        M
Loan      False
Name: 0, dtype: object
