# Sampling Using Python

## Outline
- What is Sampling?
- Simple Random Sample
- Sampling a Dataframe
- Wrap Up


## What is Sampling?
   
When you can't collect data from an entire population, sampling is performed to collect a representative "sample" of the whole population. 

Sampling can be probabilistic or non-probabilistic.

Probability sampling is defined as a sampling technique in which the researcher chooses samples from a larger population using random selection.

Non-probability sampling is defined as a sampling technique in which the researcher selects samples based on the subjective judgment of the researcher rather than random selection. 



| Probability Sampling | Non-Probability Sampling| 
| --- | --- | 
| The samples are randomly selected.	|Samples are selected on the basis of the <br> researcher’s subjective judgment.| 
|Everyone in the population has an equal <br> chance of getting selected.|	Not everyone has an equal chance to <br>participate.|
|Researchers use this technique when they <br> want to reduce sampling bias.|	Sampling bias is not a concern for <br>the researcher.|
|Used when the researcher wants to create <br> accurate samples.	|This method does not help<br> in representing the population accurately.|
|Finding the correct audience is difficult.	|Finding an audience is very simple.|

## Simple Random Sampling

Simple random sampling is the simplest  probability sampling techique. As the name suggests, it is an entirely random method of selecting the sample.

This sampling method is as easy as assigning numbers to the population we want to sample and then randomly choosing from those numbers through an automated process.


### Understanding simple random sampling with an example

The population of the United States is **330 million**. It is practically impossible to send a survey to every individual to gather information. Instead, we can identify a sample of **1 million** people using probability sampling to collect data. Sending a survey to 1 million people identified as our sample will help us question a set of citizens that are representative of the broader population and will limit the potential for bias in our sample.


![](./images/sampling.png)

### Simple random sampling in Python

In Python, simple random sampling is done using the function ```random.sample(population, k)```. Notice that this function takes two required arguments:
- `population`: a sequence. Can be any sequence: list, set, range, dictonary etc.
- `k`: the size of the returned list.

Suppose you are a data scientist at a travel company and you have access to several databases that collect data from airlines. All of these databases contain the same features, but because of their size, you may only choose four for your analysis.




In [1]:
# Select four databases to visit

#import library
import random

#defining a list with the name of the countries 
airlines_data = ["AirFrance", "KLM", "AirCanada", "BritishAirways", "Poland", "Delta", "Emirates", "United",
                 "ANA", "Lufthansa", "Thai"]
#setting seed for reproducibility
random.seed(123)
#sampling four databases
print ("Choosing four airlines databases:\n", random.sample(airlines_data,k=4))


Choosing four airlines databases:
 ['AirFrance', 'Poland', 'KLM', 'Emirates']


Notice that the `sample()` method returns a list with a random selection of a specified number of items from a sequence. 

This sample is meant to be an unbiased representation of the total population. 

## Sampling a dataframe

Sampling is a technique widely used in Data Science and Machine Learning, so it is often used on dataframes.

The `pandas` function `DataFrame.sample()` is used to generate a random sample from the dataframe. You can find the documentation about this function [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sample.html).

Consider the following dataset that contains data of flights between Australian cities (you can find more information [here](https://www.kaggle.com/alphajuliet/au-dom-traffic)).

In [None]:
#import pandas package
import pandas as pd

#read data
flight_data = pd.read_csv("australia_flights.csv")

#visualize top 5 rows
flight_data.head()

 ### Example 1: Generate 10 random rows from our dataframe.

In [None]:
#obtaining a random sample of 10 rows from flight_data
flight_10 = flight_data.sample(n =10)

#displaying the 10 rows we sampled
flight_10

 ### Example 2: Generate a sample containing 25% of the data in our dataframe.

In [None]:
#obtaining a random sample of 25% of flight_data
flights_quarter =  flight_data.sample(frac =.25) 

#displaying the 25% of the dataframe we sampled
flights_quarter

In [None]:
# checking if sample is 0.25 times data or not using len()
  
if (0.25*(len(flight_data))== len(flights_quarter)): 
    print( "Great!") 

## Wrap up
We discussed:
- Sampling is used to select a portion of the initial data
- Simple Random Sample is the most simple sampling technique and chooses an unbiased subset of the original data
- Sampling can also be performed on a dataframe to select a portion of data
