# Introduction to Pandas

Pandas is a package built on top of NumPy that provides an efficient implementation of a **DataFrame**. 

DataFrames are essentially multidimensional arrays with attached row and column labels, often with heterogeneous types and/or missing data. Pandas implements a number of powerful data operations familiar to users of both database frameworks and spreadsheet programs.


## Learning objectives

1. Fundamental Pandas data structures: the `Series`, `DataFrame`, and `Index`.
2. Indexing 
3. Selection
4. Converting data types
5. Inspection and exploring
6. Renaming, removing, and creating columns
7. Renaming and removing rows

and more


In [2]:
import numpy as np
import pandas as pd

## Pandas data structure: Series

A Pandas Series is a **one-dimensional array** of **indexed data**. 
- Can be created from a list or array or dictionary
- Combines values with **explicitly defined** indices
- like a vector

In [3]:
x = pd.Series([2.3, 5.4, 3, 9])
x

0    2.3
1    5.4
2    3.0
3    9.0
dtype: float64

In [4]:
x.values

array([2.3, 5.4, 3. , 9. ])

In [5]:
x.index

RangeIndex(start=0, stop=4, step=1)

In [6]:
# index
x[0]

2.3

In [12]:
x[1:3]

b    5.4
c    3.0
dtype: float64

In [8]:
x.dtype

dtype('float64')

In [14]:
# explicitly defined index
x = pd.Series([2.3, 5.4, 3, 9], index=["a", "b", "c", "d"])
x

a    2.3
b    5.4
c    3.0
d    9.0
dtype: float64

In [16]:
x["b"]

5.4

In [15]:
x[1]

  x[1]


5.4

#### Series as specialized dictionary

In [17]:
population_dict = {'California': 39538223, 
                   'Texas': 29145505,
                   'Florida': 21538187, 
                   'New York': 20201249,
                   'Pennsylvania': 13002700}
pop = pd.Series(population_dict)
pop

California      39538223
Texas           29145505
Florida         21538187
New York        20201249
Pennsylvania    13002700
dtype: int64

In [18]:
pop["California"]

39538223

In [19]:
pop["California":"Florida"]

California    39538223
Texas         29145505
Florida       21538187
dtype: int64

In [20]:
x = pd.Series(["Mon", "Tue", "Wed", "Thu", "Fri"])
x

0    Mon
1    Tue
2    Wed
3    Thu
4    Fri
dtype: object

In [21]:
x.dtype

dtype('O')

"0" refers to general data type.

Change the datatype using .astype()

For example, `x.astype(int)`, `x.astype(str)`, `x.astype(float)`, `x.astype("category")`

In [22]:
x = x.astype("category")
x

0    Mon
1    Tue
2    Wed
3    Thu
4    Fri
dtype: category
Categories (5, object): ['Fri', 'Mon', 'Thu', 'Tue', 'Wed']

When you convert a `Series` to a categorical type, it can have an order defined, which allows for comparisons between categories. The `ordered` attribute tells you whether the categories are treated as ordered or not.

In [26]:
x.cat.ordered

True

In [25]:
x = x.cat.reorder_categories(['Mon', 'Tue', 'Wed', 'Thu', 'Fri'], ordered=True)
x

0    Mon
1    Tue
2    Wed
3    Thu
4    Fri
dtype: category
Categories (5, object): ['Mon' < 'Tue' < 'Wed' < 'Thu' < 'Fri']

## Pandas data structure: DataFrame

a DataFrame can be viewed as a **two-dimensional array** with **explicit row and column indices**. You can think of a DataFrame as a sequence of aligned Series objects. 

`DataFrame` is like a matrix. Columns in a DataFrame are `Series`. 

- Each column is a variable. 
- Each row is an observation. 
- Each cell stores a value. 

In [27]:
area_dict = {'California': 423967, 
             'Texas': 695662, 
             'Florida': 170312,
             'New York': 141297, 
             'Pennsylvania': 119280}
area = pd.Series(area_dict)
area

California      423967
Texas           695662
Florida         170312
New York        141297
Pennsylvania    119280
dtype: int64

In [28]:
data = pd.DataFrame({"population": pop, "area": area})
data

Unnamed: 0,population,area
California,39538223,423967
Texas,29145505,695662
Florida,21538187,170312
New York,20201249,141297
Pennsylvania,13002700,119280


In [29]:
# row Index
data.index

Index(['California', 'Texas', 'Florida', 'New York', 'Pennsylvania'], dtype='object')

In [30]:
# column Index
data.columns

Index(['population', 'area'], dtype='object')

In [31]:
data["population"]

California      39538223
Texas           29145505
Florida         21538187
New York        20201249
Pennsylvania    13002700
Name: population, dtype: int64

In addition to using dictionary, a DataFrame object can be created from 
- a list of dicts
- a 2D NumPy array

In [32]:
data = pd.DataFrame([{"a": 1, "b": 2}, {"b": 3, "c": 4}])
data

Unnamed: 0,a,b,c
0,1.0,2,
1,,3,4.0


In [33]:
data = pd.DataFrame(np.random.random(10).reshape(5,2), columns=['feature1', 'feature2'])
data

Unnamed: 0,feature1,feature2
0,0.30689,0.836358
1,0.54506,0.331645
2,0.749247,0.293789
3,0.880794,0.283258
4,0.941039,0.018504


A most common way to create a data frame is from file. 

In [34]:
df = pd.read_csv("iris.csv")
df

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,virginica
146,6.3,2.5,5.0,1.9,virginica
147,6.5,3.0,5.2,2.0,virginica
148,6.2,3.4,5.4,2.3,virginica


## Pandas data structure: Index

`Index` can be thought of either as an **immutable array** or as an **ordered set**. 

Row and column identifiers of a DataFrame are of `Index` type. 

In [38]:
ind = pd.Index([2,3,4,5,6,8,10])
ind

Index([2, 3, 4, 5, 6, 8, 10], dtype='int64')

In [36]:
ind[2:]

Index([4, 5, 6, 8, 10], dtype='int64')

In [37]:
ind.shape

(7,)

In [39]:
#ind[0] = -2

TypeError: Index does not support mutable operations

In [40]:
# set operations
indA = pd.Index([1, 3, 5, 7, 9])
indB = pd.Index([2, 3, 5, 7, 11])

In [41]:
indA.union(indB)

Index([1, 2, 3, 5, 7, 9, 11], dtype='int64')

In [42]:
indA.difference(indB)

Index([1, 9], dtype='int64')

In [43]:
indA.intersection(indB)

Index([3, 5, 7], dtype='int64')

In [44]:
x = pd.read_csv("iris.csv")
x

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,virginica
146,6.3,2.5,5.0,1.9,virginica
147,6.5,3.0,5.2,2.0,virginica
148,6.2,3.4,5.4,2.3,virginica


In [45]:
x.set_index("sepal_length")

Unnamed: 0_level_0,sepal_width,petal_length,petal_width,species
sepal_length,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
5.1,3.5,1.4,0.2,setosa
4.9,3.0,1.4,0.2,setosa
4.7,3.2,1.3,0.2,setosa
4.6,3.1,1.5,0.2,setosa
5.0,3.6,1.4,0.2,setosa
...,...,...,...,...
6.7,3.0,5.2,2.3,virginica
6.3,2.5,5.0,1.9,virginica
6.5,3.0,5.2,2.0,virginica
6.2,3.4,5.4,2.3,virginica


In [46]:
x.reset_index(drop=True)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,virginica
146,6.3,2.5,5.0,1.9,virginica
147,6.5,3.0,5.2,2.0,virginica
148,6.2,3.4,5.4,2.3,virginica


## Indexing

In [47]:
data = pd.Series([0.1, 2.31, -1.2], index=[0, 1, 2])
data

0    0.10
1    2.31
2   -1.20
dtype: float64

In [48]:
data[1]

2.31

In [49]:
data.keys()

Index([0, 1, 2], dtype='int64')

In [50]:
list(data.items())

[(0, 0.1), (1, 2.31), (2, -1.2)]

In [51]:
data[7] = 3.141
data

0    0.100
1    2.310
2   -1.200
7    3.141
dtype: float64

In [52]:
# slicing
data[1:3]

1    2.31
2   -1.20
dtype: float64

In [53]:
# masking
data[(data>0) & (data<1)]

0    0.1
dtype: float64

In [54]:
data[[1,2]]

1    2.31
2   -1.20
dtype: float64

Note: If your Series has an explicit integer index, an indexing operation will use the explicit indices, while a slicing operation will use the implicit Python-style indices. 

In [55]:
data = pd.Series([0.1, 2.31, -1.2, 3.14], index=[1,3,5,7])

In [56]:
data[7]

3.14

In [58]:
data[2:4]

5   -1.20
7    3.14
dtype: float64

Hmmm, not good. Always confusing. **Use `loc` and `iloc`**

`loc` allows indexing and slicing that always references the explicit index. 

`iloc` allows indexing and slicing that always references the implicit Python-style index. 

In [61]:
data.loc[3]

2.31

In [62]:
data.loc[7]

3.14

In [63]:
data.loc[2:6]

3    2.31
5   -1.20
dtype: float64

In [64]:
data.iloc[0]

0.1

In [65]:
data.iloc[3]

3.14

In [66]:
data.iloc[1:3]

3    2.31
5   -1.20
dtype: float64

## Selection

In [3]:
import pandas as pd
df = pd.read_csv("titanic.csv")
df.head(5)

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [4]:
# select columns
df["age"]

0      22.0
1      38.0
2      26.0
3      35.0
4      35.0
       ... 
886    27.0
887    19.0
888     NaN
889    26.0
890    32.0
Name: age, Length: 891, dtype: float64

In [5]:
df.age # this doesn't always work. If the column name is not string or conflict with methods of DataFrame

0      22.0
1      38.0
2      26.0
3      35.0
4      35.0
       ... 
886    27.0
887    19.0
888     NaN
889    26.0
890    32.0
Name: age, Length: 891, dtype: float64

In [6]:
df.iloc[:,1]

0      3
1      1
2      3
3      1
4      3
      ..
886    2
887    1
888    3
889    1
890    3
Name: pclass, Length: 891, dtype: int64

In [9]:
df.iloc[:3, 1]

0    3
1    1
2    3
Name: pclass, dtype: int64

In [8]:
df.loc[df["age"] < 18]

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
7,0,3,male,2.0,3,1,21.0750,S,Third,child,False,,Southampton,no,False
9,1,2,female,14.0,1,0,30.0708,C,Second,child,False,,Cherbourg,yes,False
10,1,3,female,4.0,1,1,16.7000,S,Third,child,False,G,Southampton,yes,False
14,0,3,female,14.0,0,0,7.8542,S,Third,child,False,,Southampton,no,True
16,0,3,male,2.0,4,1,29.1250,Q,Third,child,False,,Queenstown,no,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
850,0,3,male,4.0,4,2,31.2750,S,Third,child,False,,Southampton,no,False
852,0,3,female,9.0,1,1,15.2458,C,Third,child,False,,Cherbourg,no,False
853,1,1,female,16.0,0,1,39.4000,S,First,woman,False,D,Southampton,yes,False
869,1,3,male,4.0,1,1,11.1333,S,Third,child,False,,Southampton,yes,False


In [10]:
df.loc[df["age"] < 18, ["alive", "sex", "age"]]

Unnamed: 0,alive,sex,age
7,no,male,2.0
9,yes,female,14.0
10,yes,female,4.0
14,no,female,14.0
16,no,male,2.0
...,...,...,...
850,no,male,4.0
852,no,female,9.0
853,yes,female,16.0
869,yes,male,4.0


In [12]:
df.iloc[0,1] = 0

In [13]:
df.head(1)

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,0,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False


## Converting data types

In [14]:
# understand data types
df.dtypes

survived         int64
pclass           int64
sex             object
age            float64
sibsp            int64
parch            int64
fare           float64
embarked        object
class           object
who             object
adult_male        bool
deck            object
embark_town     object
alive           object
alone             bool
dtype: object

In [15]:
df["pclass"].unique()

array([0, 1, 3, 2], dtype=int64)

In [16]:
# Convert Pclass from object to category. 
df["pclass"] = df["pclass"].astype("category")
df["pclass"].dtype

CategoricalDtype(categories=[0, 1, 2, 3], ordered=False, categories_dtype=int64)

## Inspection and exploring

In [17]:
df.shape

(891, 15)

In [18]:
df.head(5)

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,0,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [19]:
df.tail(5)

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
886,0,2,male,27.0,0,0,13.0,S,Second,man,True,,Southampton,no,True
887,1,1,female,19.0,0,0,30.0,S,First,woman,False,B,Southampton,yes,True
888,0,3,female,,1,2,23.45,S,Third,woman,False,,Southampton,no,False
889,1,1,male,26.0,0,0,30.0,C,First,man,True,C,Cherbourg,yes,True
890,0,3,male,32.0,0,0,7.75,Q,Third,man,True,,Queenstown,no,True


In [20]:
df.sample(n=5)

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
489,1,3,male,9.0,1,1,15.9,S,Third,child,False,,Southampton,yes,False
252,0,1,male,62.0,0,0,26.55,S,First,man,True,C,Southampton,no,True
461,0,3,male,34.0,0,0,8.05,S,Third,man,True,,Southampton,no,True
323,1,2,female,22.0,1,1,29.0,S,Second,woman,False,,Southampton,yes,False
577,1,1,female,39.0,1,0,55.9,S,First,woman,False,E,Southampton,yes,False


In [21]:
df.sample(frac=0.01)

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
170,0,1,male,61.0,0,0,33.5,S,First,man,True,B,Southampton,no,True
284,0,1,male,,0,0,26.0,S,First,man,True,A,Southampton,no,True
431,1,3,female,,1,0,16.1,S,Third,woman,False,,Southampton,yes,False
69,0,3,male,26.0,2,0,8.6625,S,Third,man,True,,Southampton,no,False
736,0,3,female,48.0,1,3,34.375,S,Third,woman,False,,Southampton,no,False
230,1,1,female,35.0,1,0,83.475,S,First,woman,False,C,Southampton,yes,False
506,1,2,female,33.0,0,2,26.0,S,Second,woman,False,,Southampton,yes,False
834,0,3,male,18.0,0,0,8.3,S,Third,man,True,,Southampton,no,True
84,1,2,female,17.0,0,0,10.5,S,Second,woman,False,,Southampton,yes,True


In [22]:
df.describe()

Unnamed: 0,survived,age,sibsp,parch,fare
count,891.0,714.0,891.0,891.0,891.0
mean,0.383838,29.699118,0.523008,0.381594,32.204208
std,0.486592,14.526497,1.102743,0.806057,49.693429
min,0.0,0.42,0.0,0.0,0.0
25%,0.0,20.125,0.0,0.0,7.9104
50%,0.0,28.0,0.0,0.0,14.4542
75%,1.0,38.0,1.0,0.0,31.0
max,1.0,80.0,8.0,6.0,512.3292


## renaming columns

In [24]:
orig_colnames = df.columns
orig_colnames

Index(['survived', 'pclass', 'sex', 'age', 'sibsp', 'parch', 'fare',
       'embarked', 'class', 'who', 'adult_male', 'deck', 'embark_town',
       'alive', 'alone'],
      dtype='object')

In [26]:
df.columns = list("abcdefghijklmno")
df

Unnamed: 0,a,b,c,d,e,f,g,h,i,j,k,l,m,n,o
0,0,0,male,22.0,1,0,7.2500,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.9250,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1000,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.0500,S,Third,man,True,,Southampton,no,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,0,2,male,27.0,0,0,13.0000,S,Second,man,True,,Southampton,no,True
887,1,1,female,19.0,0,0,30.0000,S,First,woman,False,B,Southampton,yes,True
888,0,3,female,,1,2,23.4500,S,Third,woman,False,,Southampton,no,False
889,1,1,male,26.0,0,0,30.0000,C,First,man,True,C,Cherbourg,yes,True


In [27]:
df.columns = orig_colnames
df

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,0,male,22.0,1,0,7.2500,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.9250,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1000,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.0500,S,Third,man,True,,Southampton,no,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,0,2,male,27.0,0,0,13.0000,S,Second,man,True,,Southampton,no,True
887,1,1,female,19.0,0,0,30.0000,S,First,woman,False,B,Southampton,yes,True
888,0,3,female,,1,2,23.4500,S,Third,woman,False,,Southampton,no,False
889,1,1,male,26.0,0,0,30.0000,C,First,man,True,C,Cherbourg,yes,True


## removing columns

In [42]:
df.drop("survived", axis=1)

Unnamed: 0,pclass,sex,Age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone,Fare + Age
0,0,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False,29.2500
1,1,female,38.0,1,0,71.28,C,First,woman,False,C,Cherbourg,yes,False,109.2833
2,3,female,26.0,0,0,7.92,S,Third,woman,False,,Southampton,yes,True,33.9250
3,1,female,35.0,1,0,53.10,S,First,woman,False,C,Southampton,yes,False,88.1000
4,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True,43.0500
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,2,male,27.0,0,0,13.00,S,Second,man,True,,Southampton,no,True,40.0000
887,1,female,19.0,0,0,30.00,S,First,woman,False,B,Southampton,yes,True,49.0000
888,3,female,,1,2,23.45,S,Third,woman,False,,Southampton,no,False,
889,1,male,26.0,0,0,30.00,C,First,man,True,C,Cherbourg,yes,True,56.0000


In [31]:
df.drop(columns=["pclass","survived", "sex", "age"])

Unnamed: 0,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,1,0,7.2500,S,Third,man,True,,Southampton,no,False
1,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,0,0,7.9250,S,Third,woman,False,,Southampton,yes,True
3,1,0,53.1000,S,First,woman,False,C,Southampton,yes,False
4,0,0,8.0500,S,Third,man,True,,Southampton,no,True
...,...,...,...,...,...,...,...,...,...,...,...
886,0,0,13.0000,S,Second,man,True,,Southampton,no,True
887,0,0,30.0000,S,First,woman,False,B,Southampton,yes,True
888,1,2,23.4500,S,Third,woman,False,,Southampton,no,False
889,0,0,30.0000,C,First,man,True,C,Cherbourg,yes,True


In [41]:
df.rename(columns={"age": "Age"}, inplace=True)

## transforming and creating columns

In [44]:
df["Fare + Age"] = df["fare"] + df["Age"]
df

Unnamed: 0,survived,pclass,sex,Age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone,Fare + Age
0,0,0,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False,29.25
1,1,1,female,38.0,1,0,71.28,C,First,woman,False,C,Cherbourg,yes,False,109.28
2,1,3,female,26.0,0,0,7.92,S,Third,woman,False,,Southampton,yes,True,33.92
3,1,1,female,35.0,1,0,53.10,S,First,woman,False,C,Southampton,yes,False,88.10
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True,43.05
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,0,2,male,27.0,0,0,13.00,S,Second,man,True,,Southampton,no,True,40.00
887,1,1,female,19.0,0,0,30.00,S,First,woman,False,B,Southampton,yes,True,49.00
888,0,3,female,,1,2,23.45,S,Third,woman,False,,Southampton,no,False,
889,1,1,male,26.0,0,0,30.00,C,First,man,True,C,Cherbourg,yes,True,56.00


In [47]:
import numpy as np
df["fare"] = np.round(df["fare"],0)
df

Unnamed: 0,survived,pclass,sex,Age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone,Fare + Age
0,0,0,male,22.0,1,0,7.0,S,Third,man,True,,Southampton,no,False,29.25
1,1,1,female,38.0,1,0,71.0,C,First,woman,False,C,Cherbourg,yes,False,109.28
2,1,3,female,26.0,0,0,8.0,S,Third,woman,False,,Southampton,yes,True,33.92
3,1,1,female,35.0,1,0,53.0,S,First,woman,False,C,Southampton,yes,False,88.10
4,0,3,male,35.0,0,0,8.0,S,Third,man,True,,Southampton,no,True,43.05
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,0,2,male,27.0,0,0,13.0,S,Second,man,True,,Southampton,no,True,40.00
887,1,1,female,19.0,0,0,30.0,S,First,woman,False,B,Southampton,yes,True,49.00
888,0,3,female,,1,2,23.0,S,Third,woman,False,,Southampton,no,False,
889,1,1,male,26.0,0,0,30.0,C,First,man,True,C,Cherbourg,yes,True,56.00


# In Class Activity

In [50]:
df["above_average"] = df["fare"] > df["fare"].mean()+10
df

Unnamed: 0,survived,pclass,sex,Age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone,Fare + Age,above_average
0,0,0,male,22.0,1,0,7.0,S,Third,man,True,,Southampton,no,False,29.25,False
1,1,1,female,38.0,1,0,71.0,C,First,woman,False,C,Cherbourg,yes,False,109.28,True
2,1,3,female,26.0,0,0,8.0,S,Third,woman,False,,Southampton,yes,True,33.92,False
3,1,1,female,35.0,1,0,53.0,S,First,woman,False,C,Southampton,yes,False,88.10,True
4,0,3,male,35.0,0,0,8.0,S,Third,man,True,,Southampton,no,True,43.05,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,0,2,male,27.0,0,0,13.0,S,Second,man,True,,Southampton,no,True,40.00,False
887,1,1,female,19.0,0,0,30.0,S,First,woman,False,B,Southampton,yes,True,49.00,False
888,0,3,female,,1,2,23.0,S,Third,woman,False,,Southampton,no,False,,False
889,1,1,male,26.0,0,0,30.0,C,First,man,True,C,Cherbourg,yes,True,56.00,False


### renaming rows

In [36]:
df_sub = df.sample(n=3, random_state=42)
df_sub

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone,Fare + Age
709,1,3,male,,1,1,15.25,C,Third,man,True,,Cherbourg,yes,False,
439,0,2,male,31.0,0,0,10.5,S,Second,man,True,,Southampton,no,True,41.5
840,0,3,male,20.0,0,0,7.92,S,Third,man,True,,Southampton,no,True,27.925


In [52]:
df_sub.rename({709:"a", 439:"b", 840:"c"})

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone,Fare + Age
hello,1,3,male,,1,1,15.25,C,Third,man,True,,Cherbourg,yes,False,
world,0,2,male,31.0,0,0,10.5,S,Second,man,True,,Southampton,no,True,41.5
!,0,3,male,20.0,0,0,7.92,S,Third,man,True,,Southampton,no,True,27.925


In [53]:
df_sub.index=["hello", "world", "!"]
df_sub

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone,Fare + Age
hello,1,3,male,,1,1,15.25,C,Third,man,True,,Cherbourg,yes,False,
world,0,2,male,31.0,0,0,10.5,S,Second,man,True,,Southampton,no,True,41.5
!,0,3,male,20.0,0,0,7.92,S,Third,man,True,,Southampton,no,True,27.925


In [56]:
df_sub.reset_index(drop=True)

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone,Fare + Age
0,1,3,male,,1,1,15.25,C,Third,man,True,,Cherbourg,yes,False,
1,0,2,male,31.0,0,0,10.5,S,Second,man,True,,Southampton,no,True,41.5
2,0,3,male,20.0,0,0,7.92,S,Third,man,True,,Southampton,no,True,27.925


### removing rows

In [59]:
df_sub = df.sample(n=10, random_state=42)
df_sub

Unnamed: 0,survived,pclass,sex,Age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone,Fare + Age,above_average
709,1,3,male,,1,1,15.0,C,Third,man,True,,Cherbourg,yes,False,,False
439,0,2,male,31.0,0,0,10.0,S,Second,man,True,,Southampton,no,True,41.5,False
840,0,3,male,20.0,0,0,8.0,S,Third,man,True,,Southampton,no,True,27.92,False
720,1,2,female,6.0,0,1,33.0,S,Second,child,False,,Southampton,yes,False,39.0,False
39,1,3,female,14.0,1,0,11.0,C,Third,child,False,,Cherbourg,yes,False,25.24,False
290,1,1,female,26.0,0,0,79.0,S,First,woman,False,,Southampton,yes,True,104.85,True
300,1,3,female,,0,0,8.0,Q,Third,woman,False,,Queenstown,yes,True,,False
333,0,3,male,16.0,2,0,18.0,S,Third,man,True,,Southampton,no,False,34.0,False
208,1,3,female,16.0,0,0,8.0,Q,Third,woman,False,,Queenstown,yes,True,23.75,False
136,1,1,female,19.0,0,2,26.0,S,First,woman,False,D,Southampton,yes,False,45.28,False


In [60]:
df_sub.drop([709,439], axis=0)

Unnamed: 0,survived,pclass,sex,Age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone,Fare + Age,above_average
840,0,3,male,20.0,0,0,8.0,S,Third,man,True,,Southampton,no,True,27.92,False
720,1,2,female,6.0,0,1,33.0,S,Second,child,False,,Southampton,yes,False,39.0,False
39,1,3,female,14.0,1,0,11.0,C,Third,child,False,,Cherbourg,yes,False,25.24,False
290,1,1,female,26.0,0,0,79.0,S,First,woman,False,,Southampton,yes,True,104.85,True
300,1,3,female,,0,0,8.0,Q,Third,woman,False,,Queenstown,yes,True,,False
333,0,3,male,16.0,2,0,18.0,S,Third,man,True,,Southampton,no,False,34.0,False
208,1,3,female,16.0,0,0,8.0,Q,Third,woman,False,,Queenstown,yes,True,23.75,False
136,1,1,female,19.0,0,2,26.0,S,First,woman,False,D,Southampton,yes,False,45.28,False


In [62]:
idx = df_sub.loc[df_sub["alone"] == True].index
idx

Index([439, 840, 290, 300, 208], dtype='int64')

In [None]:
df_sub.drop(idx, axis=0)

In [72]:
df_sub.query("Age" <= 30 and "sex" == "female")


TypeError: '<=' not supported between instances of 'str' and 'int'

## Create new col for sqrt of fare

In [1]:
df["sqrt_of_fare"] = df['fare'].apply(lambda row: np.sqrt(row['fare'] if row['fare'] == row['class'] == 'First'), axis = 1)
df

SyntaxError: expected 'else' after 'if' expression (3025059423.py, line 1)

## Operating

In [5]:
A = pd.DataFrame(np.random.randint(0,10,15).reshape(5,3), columns=["f1", "f2", "f3"])
A

Unnamed: 0,f1,f2,f3
0,0,5,8
1,2,0,8
2,4,2,8
3,1,0,1
4,7,9,6


In [6]:
B = pd.DataFrame(np.random.randint(0,10,6).reshape(2,3), columns=["f1", "f2", "f4"])
B

Unnamed: 0,f1,f2,f4
0,8,1,0
1,6,4,9


In [7]:
A+B

Unnamed: 0,f1,f2,f3,f4
0,8.0,6.0,,
1,8.0,4.0,,
2,,,,
3,,,,
4,,,,


In [8]:
A - A.iloc[0]

Unnamed: 0,f1,f2,f3
0,0,0,0
1,2,-5,0
2,4,-3,0
3,1,-5,-7
4,7,4,-2


## Missing values

Missing values are quite common in real datasets. Pandas provides useful methods for detecting, removing, and replacing null values in Pandas data structures.

- `isnull`: Generates a Boolean mask indicating missing values
- `notnull`: Opposite of isnull
- `dropna`: Returns a filtered version of the data
- `fillna`: Returns a copy of the data with missing values filled or imputed

In [18]:
df = pd.DataFrame([[1, np.nan, 2], [2, 3, 5], [np.nan, 4, 6]])
df

Unnamed: 0,0,1,2
0,1.0,,2
1,2.0,3.0,5
2,,4.0,6


In [19]:
df.isnull()

Unnamed: 0,0,1,2
0,False,True,False
1,False,False,False
2,True,False,False


In [20]:
df.notnull()

Unnamed: 0,0,1,2
0,True,False,True
1,True,True,True
2,False,True,True


In [21]:
df.dropna()

Unnamed: 0,0,1,2
1,2.0,3.0,5


In [22]:
df.dropna(axis=1)

Unnamed: 0,2
0,2
1,5
2,6


In [23]:
df[3] = np.nan
df

Unnamed: 0,0,1,2,3
0,1.0,,2,
1,2.0,3.0,5,
2,,4.0,6,


In [15]:
df.dropna(axis=1, how="all")

Unnamed: 0,0,1,2
0,1.0,,2
1,2.0,3.0,5
2,,4.0,6


In [24]:
df.dropna(thresh=3)

Unnamed: 0,0,1,2,3
1,2.0,3.0,5,


In [25]:
# fillna
df

Unnamed: 0,0,1,2,3
0,1.0,,2,
1,2.0,3.0,5,
2,,4.0,6,


In [26]:
# fillna with a single value
df.fillna(-100)

Unnamed: 0,0,1,2,3
0,1.0,-100.0,2,-100.0
1,2.0,3.0,5,-100.0
2,-100.0,4.0,6,-100.0


In [27]:
df.fillna(df.mean(axis=0))

Unnamed: 0,0,1,2,3
0,1.0,3.5,2,
1,2.0,3.0,5,
2,1.5,4.0,6,


In [28]:
df.fillna(method="ffill")

  df.fillna(method="ffill")


Unnamed: 0,0,1,2,3
0,1.0,,2,
1,2.0,3.0,5,
2,2.0,4.0,6,


In [29]:
df.fillna(method="bfill")

  df.fillna(method="bfill")


Unnamed: 0,0,1,2,3
0,1.0,3.0,2,
1,2.0,3.0,5,
2,,4.0,6,


## Sorting

In [31]:
iris = pd.read_csv("iris.csv")
iris

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,virginica
146,6.3,2.5,5.0,1.9,virginica
147,6.5,3.0,5.2,2.0,virginica
148,6.2,3.4,5.4,2.3,virginica


In [33]:
sorted_iris = iris.sort_values(by='sepal_length', ascending=True)
sorted_iris

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
13,4.3,3.0,1.1,0.1,setosa
42,4.4,3.2,1.3,0.2,setosa
38,4.4,3.0,1.3,0.2,setosa
8,4.4,2.9,1.4,0.2,setosa
41,4.5,2.3,1.3,0.3,setosa
...,...,...,...,...,...
122,7.7,2.8,6.7,2.0,virginica
118,7.7,2.6,6.9,2.3,virginica
117,7.7,3.8,6.7,2.2,virginica
135,7.7,3.0,6.1,2.3,virginica


In [34]:
sorted_iris = iris.sort_values(by=['sepal_length', 'petal_length'], ascending=[True, False])
sorted_iris

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
13,4.3,3.0,1.1,0.1,setosa
8,4.4,2.9,1.4,0.2,setosa
38,4.4,3.0,1.3,0.2,setosa
42,4.4,3.2,1.3,0.2,setosa
41,4.5,2.3,1.3,0.3,setosa
...,...,...,...,...,...
118,7.7,2.6,6.9,2.3,virginica
117,7.7,3.8,6.7,2.2,virginica
122,7.7,2.8,6.7,2.0,virginica
135,7.7,3.0,6.1,2.3,virginica


## MultiIndex

The `MultiIndex` represents multiple levels of indexing.

In [36]:
index = [('California', 2010), ('California', 2020),
         ('New York', 2010), ('New York', 2020),
         ('Texas', 2010), ('Texas', 2020)]
populations = [37253956, 39538223, 19378102, 20201249, 25145561, 29145505]
index = pd.MultiIndex.from_tuples(index)
pop = pd.Series(populations, index=index)
pop

California  2010    37253956
            2020    39538223
New York    2010    19378102
            2020    20201249
Texas       2010    25145561
            2020    29145505
dtype: int64

In [37]:
pop["California"]

2010    37253956
2020    39538223
dtype: int64

In [38]:
pop[:,2020]

California    39538223
New York      20201249
Texas         29145505
dtype: int64

In [39]:
df_pop = pd.DataFrame({'total': pop,
                       'under18': [9284094, 8898092, 4318033, 4181528, 6879014, 7432474]})
df_pop

Unnamed: 0,Unnamed: 1,total,under18
California,2010,37253956,9284094
California,2020,39538223,8898092
New York,2010,19378102,4318033
New York,2020,20201249,4181528
Texas,2010,25145561,6879014
Texas,2020,29145505,7432474


In [40]:
df_pop.index

MultiIndex([('California', 2010),
            ('California', 2020),
            (  'New York', 2010),
            (  'New York', 2020),
            (     'Texas', 2010),
            (     'Texas', 2020)],
           )

In [41]:
df_pop.index.names=["state", "year"]

In [42]:
df_pop

Unnamed: 0_level_0,Unnamed: 1_level_0,total,under18
state,year,Unnamed: 2_level_1,Unnamed: 3_level_1
California,2010,37253956,9284094
California,2020,39538223,8898092
New York,2010,19378102,4318033
New York,2020,20201249,4181528
Texas,2010,25145561,6879014
Texas,2020,29145505,7432474


In [43]:
index = pd.MultiIndex.from_product([[2013, 2014], [1, 2]], names=['year', 'visit'])
columns = pd.MultiIndex.from_product([['Bob', 'Guido', 'Sue'], ['HR', 'Temp']], names=['subject', 'type'])
X = np.random.random(24).reshape(4,6)
df = pd.DataFrame(X, index=index, columns=columns)
df

Unnamed: 0_level_0,subject,Bob,Bob,Guido,Guido,Sue,Sue
Unnamed: 0_level_1,type,HR,Temp,HR,Temp,HR,Temp
year,visit,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
2013,1,0.008046,0.109864,0.067247,0.89716,0.464497,0.232662
2013,2,0.447188,0.960026,0.539938,0.984265,0.073466,0.693804
2014,1,0.827818,0.257014,0.394495,0.219834,0.764677,0.571397
2014,2,0.003773,0.132587,0.006599,0.599116,0.892633,0.158513


In [44]:
df["Bob"]

Unnamed: 0_level_0,type,HR,Temp
year,visit,Unnamed: 2_level_1,Unnamed: 3_level_1
2013,1,0.008046,0.109864
2013,2,0.447188,0.960026
2014,1,0.827818,0.257014
2014,2,0.003773,0.132587


In [45]:
df.loc[[2013]]

Unnamed: 0_level_0,subject,Bob,Bob,Guido,Guido,Sue,Sue
Unnamed: 0_level_1,type,HR,Temp,HR,Temp,HR,Temp
year,visit,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
2013,1,0.008046,0.109864,0.067247,0.89716,0.464497,0.232662
2013,2,0.447188,0.960026,0.539938,0.984265,0.073466,0.693804


In [46]:
df.loc[(2013, 1),:]

subject  type
Bob      HR      0.008046
         Temp    0.109864
Guido    HR      0.067247
         Temp    0.897160
Sue      HR      0.464497
         Temp    0.232662
Name: (2013, 1), dtype: float64

In [47]:
df.loc[:,("Bob", "HR")]

year  visit
2013  1        0.008046
      2        0.447188
2014  1        0.827818
      2        0.003773
Name: (Bob, HR), dtype: float64

In [48]:
idx = pd.IndexSlice
df.loc[idx[:, 1], idx[:, 'HR']]

Unnamed: 0_level_0,subject,Bob,Guido,Sue
Unnamed: 0_level_1,type,HR,HR,HR
year,visit,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
2013,1,0.008046,0.067247,0.464497
2014,1,0.827818,0.394495,0.764677


In [49]:
idx = pd.IndexSlice

In [51]:
df = pd.read_csv("titanic.csv")
df

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.2500,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.9250,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1000,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.0500,S,Third,man,True,,Southampton,no,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,0,2,male,27.0,0,0,13.0000,S,Second,man,True,,Southampton,no,True
887,1,1,female,19.0,0,0,30.0000,S,First,woman,False,B,Southampton,yes,True
888,0,3,female,,1,2,23.4500,S,Third,woman,False,,Southampton,no,False
889,1,1,male,26.0,0,0,30.0000,C,First,man,True,C,Cherbourg,yes,True


In [65]:
df.groupby(['sex']).count()

Unnamed: 0_level_0,survived,pclass,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
female,314,314,261,314,314,314,312,314,314,314,97,312,314,314
male,577,577,453,577,577,577,577,577,577,577,106,577,577,577


## Combining datasets

### `concat` 

In [66]:
ser1 = pd.Series(['A', 'B', 'C'], index=[1, 2, 3])
ser2 = pd.Series(['D', 'E', 'F'], index=[4, 5, 6])
pd.concat([ser1, ser2])

1    A
2    B
3    C
4    D
5    E
6    F
dtype: object

In [62]:
df1 = pd.DataFrame(np.random.random((5,2)), columns=["A", "B"])
df2 = pd.DataFrame(np.random.random((5,2)), columns=["A", "B"])

In [67]:
df1

Unnamed: 0,A,B
0,0.420882,0.904974
1,0.14672,0.824134
2,0.018354,0.130225
3,0.800233,0.794239
4,0.702571,0.2076


In [68]:
df2

Unnamed: 0,A,B
0,0.155264,0.599921
1,0.618,0.501486
2,0.263462,0.725118
3,0.353284,0.396447
4,0.732665,0.242917


In [69]:
pd.concat([df1, df2])

Unnamed: 0,A,B
0,0.420882,0.904974
1,0.14672,0.824134
2,0.018354,0.130225
3,0.800233,0.794239
4,0.702571,0.2076
0,0.155264,0.599921
1,0.618,0.501486
2,0.263462,0.725118
3,0.353284,0.396447
4,0.732665,0.242917


In [70]:
pd.concat([df1, df2], axis=1)

Unnamed: 0,A,B,A.1,B.1
0,0.420882,0.904974,0.155264,0.599921
1,0.14672,0.824134,0.618,0.501486
2,0.018354,0.130225,0.263462,0.725118
3,0.800233,0.794239,0.353284,0.396447
4,0.702571,0.2076,0.732665,0.242917


In [None]:
# fix duplicate indices

In [74]:
pd.concat([df1, df2], verify_integrity=True)

ValueError: Indexes have overlapping values: Index([0, 1, 2, 3, 4], dtype='int64')

In [73]:
pd.concat([df1, df2], ignore_index=True)

Unnamed: 0,A,B
0,0.420882,0.904974
1,0.14672,0.824134
2,0.018354,0.130225
3,0.800233,0.794239
4,0.702571,0.2076
5,0.155264,0.599921
6,0.618,0.501486
7,0.263462,0.725118
8,0.353284,0.396447
9,0.732665,0.242917


### `append`

In [75]:
df1.append(df2)

AttributeError: 'DataFrame' object has no attribute 'append'

In [76]:
df1.append(df2, ignore_index=True)

AttributeError: 'DataFrame' object has no attribute 'append'

# In-class activity: Divide the class into 22 groups for final project randomly. Each group has 4-5 students.

# Final project groups

In [78]:
students_df = pd.read_csv('students_list.csv')
students_df

Unnamed: 0,First Name,Last Name,Email
0,Abhimanyu,Agashe,manyu@unc.edu
1,Abhishri,Agrawal,abhishri@unc.edu
2,Adam,Zawati,zawati@unc.edu
3,Adil,Syed,adilsyed@unc.edu
4,Aditi,Patil,aditipat@unc.edu
...,...,...,...
96,Yizhe,Yang,yangy23@unc.edu
97,Zahra,Alqudaihi,zahraq58@ad.unc.edu
98,Zhaojiayi,Zhang,zzhang27@unc.edu
99,Zheyuan,Liu,zheyuan@ad.unc.edu


In [90]:
group_sizes = [4] * 9 + [5] * 13
group_sizes

[4, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5]

In [91]:
groupid = []

for id_group, group_sizes in enumerate(group_sizes):
    groupid = np.hstack([groupid, np.ones(group_sizes) * (id_group)])

groupid

array([ 0.,  0.,  0.,  0.,  1.,  1.,  1.,  1.,  2.,  2.,  2.,  2.,  3.,
        3.,  3.,  3.,  4.,  4.,  4.,  4.,  5.,  5.,  5.,  5.,  6.,  6.,
        6.,  6.,  7.,  7.,  7.,  7.,  8.,  8.,  8.,  8.,  9.,  9.,  9.,
        9.,  9., 10., 10., 10., 10., 10., 11., 11., 11., 11., 11., 12.,
       12., 12., 12., 12., 13., 13., 13., 13., 13., 14., 14., 14., 14.,
       14., 15., 15., 15., 15., 15., 16., 16., 16., 16., 16., 17., 17.,
       17., 17., 17., 18., 18., 18., 18., 18., 19., 19., 19., 19., 19.,
       20., 20., 20., 20., 20., 21., 21., 21., 21., 21.])

In [93]:
np.random.shuffle(groupid)
groupid

array([ 2.,  1.,  6.,  6.,  6., 18., 19.,  9.,  8., 18.,  7.,  6.,  3.,
       18.,  0., 21.,  7.,  1., 10., 15., 20., 19., 16., 20., 13., 12.,
       16., 15.,  2., 13., 17.,  7., 21., 15.,  9.,  0., 14., 16.,  9.,
        2.,  5., 21., 10., 12., 11.,  0., 15., 15., 14., 21.,  7., 11.,
        8., 19., 13., 12.,  1.,  3., 10., 17., 11., 18., 10., 13.,  8.,
       11.,  5., 20.,  0., 14., 17., 18., 20., 11.,  4., 16.,  4.,  3.,
       10.,  8.,  5.,  2.,  3.,  4., 19., 12., 13.,  1.,  9.,  4.,  5.,
        9., 17., 14., 12., 21., 20., 17., 14., 19., 16.])

For example, there are 8 groups of 4 students and 14 groups of 5 students.

In [95]:
random_students = students_df.sample(frac=1)
random_students["Group"] = groupid

In [96]:
random_students

Unnamed: 0,First Name,Last Name,Email,Group
1,Abhishri,Agrawal,abhishri@unc.edu,2.0
24,Erin,Morley,emorley@unc.edu,1.0
4,Aditi,Patil,aditipat@unc.edu,6.0
98,Zhaojiayi,Zhang,zzhang27@unc.edu,6.0
23,Emily,Nguyen,aemirri@unc.edu,6.0
...,...,...,...,...
8,Atticus,Bacon,atticusb@unc.edu,20.0
79,Shenghe,Ma,mshenghe@unc.edu,17.0
22,Edison,Guo,edguo@ad.unc.edu,14.0
71,Robert,Trenkamp,robert.trenkamp@unc.edu,19.0


In [100]:
random_students['Group'].apply(int)
random_students

Unnamed: 0,First Name,Last Name,Email,Group
1,Abhishri,Agrawal,abhishri@unc.edu,2.0
24,Erin,Morley,emorley@unc.edu,1.0
4,Aditi,Patil,aditipat@unc.edu,6.0
98,Zhaojiayi,Zhang,zzhang27@unc.edu,6.0
23,Emily,Nguyen,aemirri@unc.edu,6.0
...,...,...,...,...
8,Atticus,Bacon,atticusb@unc.edu,20.0
79,Shenghe,Ma,mshenghe@unc.edu,17.0
22,Edison,Guo,edguo@ad.unc.edu,14.0
71,Robert,Trenkamp,robert.trenkamp@unc.edu,19.0
