In [2]:
import pandas as pd

### Pandas Sample
Pandas Sample is a great way to pull random (a sample) of rows from your DataFrame. I use this most often when I need to subset my data, but I want to do it randomly.

Examples we'll run through:
1. Simple sample setting 'n'
2. Simple sample setting 'frac'
3. Sample setting 'n' and replace
4. Sample with weights
5. Sample random columns

But first, let's start with a couple of lists of restaurants in San Francisco:

In [24]:
df = pd.DataFrame([('Foreign Cinema', 'Restaurant', 289.0),
                   ('Liho Liho', 'Restaurant', 224.0),
                   ('500 Club', 'bar', 80.5),
                   ('The Square', 'bar', 25.30),
                   ('Page', 'bar', 80.34),
                   ('Tompkins', 'bar', 34.2),
                   ('Als Place', 'Restaurant', 56.52),],
           columns=('name', 'type', 'AvgBill')
                 )
df

Unnamed: 0,name,type,AvgBill
0,Foreign Cinema,Restaurant,289.0
1,Liho Liho,Restaurant,224.0
2,500 Club,bar,80.5
3,The Square,bar,25.3
4,Page,bar,80.34
5,Tompkins,bar,34.2
6,Als Place,Restaurant,56.52


### 1. Simple sample setting 'n'
Specifying 'n' is specifying the number of random rows you want to return.

Notice how I specify n=2 and I get two random rows back.

In [25]:
df.sample(n=2)

Unnamed: 0,name,type,AvgBill
3,The Square,bar,25.3
0,Foreign Cinema,Restaurant,289.0


If I do it again, I get another set of random rows

In [26]:
df.sample(n=2)

Unnamed: 0,name,type,AvgBill
6,Als Place,Restaurant,56.52
2,500 Club,bar,80.5


### 2. Simple sample setting 'frac'
Instead of setting 'n' you could specifying 'frac' which tells pandas what franction of your dataframe do you want to randomly return to you?

Here I'm setting frac=.4 or 40%. So since I have 7 rows, 40% is 3 rows (2.8 rounded up).

In [27]:
df.sample(frac=.4)

Unnamed: 0,name,type,AvgBill
3,The Square,bar,25.3
5,Tompkins,bar,34.2
4,Page,bar,80.34


### 3. Sample setting 'n' and replace
By default, pandas will only select a random row once. However, if you wanted to be able to select the same row more than once, then you can set replace=True. This will 'replace' your rows back into the DataFrame for sampling again.

With this case, you'll be able to set your n greater than the # of rows you have in your DataFrame.

Notice the same row below is randomly picked twice now.

In [28]:
df.sample(n=5, replace=True)

Unnamed: 0,name,type,AvgBill
3,The Square,bar,25.3
2,500 Club,bar,80.5
6,Als Place,Restaurant,56.52
5,Tompkins,bar,34.2
5,Tompkins,bar,34.2


### 4. Sample with weights
By default, pandas give each row an equal chance to be selected. However, what if you wanted to select restaurants more often than bars? You could give restaurants a higher chance (higher weights) to be picked.

First let me add weights to my DataFrame. I want resturants to have 5x chance to be randomly picked than bars. I'll give each restaurant a weights=2 and bars weights=1.

In [40]:
weights = {'Restaurant': 5,
          'bar': 1}
df['weights'] = df['type'].apply(lambda x: weights[x])
df

Unnamed: 0,name,type,AvgBill,weights
0,Foreign Cinema,Restaurant,289.0,5
1,Liho Liho,Restaurant,224.0,5
2,500 Club,bar,80.5,1
3,The Square,bar,25.3,1
4,Page,bar,80.34,1
5,Tompkins,bar,34.2,1
6,Als Place,Restaurant,56.52,5


Here I'll pull a random sample of 3 rows from my DataFrame and pass my weights column. I set random state to make sure I get the same random numbers each time. Notice how 2 restaurants pop up out of the 3 rows. That is because they had higher weights and therefore a bigger chance to be picked.

In [43]:
df.sample(n=3, weights='weights', random_state=42)

Unnamed: 0,name,type,AvgBill,weights
1,Liho Liho,Restaurant,224.0,5
6,Als Place,Restaurant,56.52,5
5,Tompkins,bar,34.2,1


### 5. Sample random columns
Say you wanted to randomly select columns instead of rows. Just set axis=1.

In [45]:
df.sample(n=2, axis=1)

Unnamed: 0,type,AvgBill
0,Restaurant,289.0
1,Restaurant,224.0
2,bar,80.5
3,bar,25.3
4,bar,80.34
5,bar,34.2
6,Restaurant,56.52


Remember, you'll get random items each time you run your code unless you set a random_state

In [46]:
df.sample(n=2, axis=1)

Unnamed: 0,AvgBill,weights
0,289.0,5
1,224.0,5
2,80.5,1
3,25.3,1
4,80.34,1
5,34.2,1
6,56.52,5
