# Sampling Methods - Titanic Dataset

### Objective
The purpose of this exercise is to apply sampling methods to and draw conclusions about the representativeness of the obtained samples using variables under investiagtion.

## 1. Characteristics of selected sampling methods

#### 1. Simple Random Sampling
Simple random sampling involves selecting individuals randomly from the population, where each individual has an equal chance of being selected. Use simple random sampling when the population is homogeneous and you want each member to have an equal chance of being included. It's straightforward and suitable when there are no known differences or groupings within the population.

### 2. Stratified Random Sampling
Stratified random sampling divides the population into homogeneous subgroups (strata) based on certain characteristics (e.g., age groups, socio-economic status) and then samples randomly from each subgroup. This ensures that each subgroup is represented proportionally in the sample. Use stratified random sampling when the population can be divided into distinct groups with different characteristics, and you want to ensure representation from each group in your sample. It helps in reducing variability and increasing precision for subgroup analyses.

### 3. Cluster Sampling
Cluster sampling involves dividing the population into clusters (e.g., geographical areas, schools, households) and then randomly selecting entire clusters to be included in the sample. Unlike stratified sampling, where individuals are randomly sampled from each stratum, cluster sampling involves sampling entire groups or clusters. Use cluster sampling when the population is large and spread out over a wide geographic area or when it's easier to access groups or clusters rather than individual members. It can be more cost-effective and logistically feasible compared to other methods, especially in field surveys.

### 4. Systematic Sampling
Systematic sampling selects individuals from a population at regular intervals, such as every nth individual after a random starting point. It involves choosing a random starting point and then selecting every kth element from the population. Use systematic sampling when the population is large and ordered in some manner (e.g., alphabetically, chronologically). It provides a simple and efficient way to sample from a large population without needing a complete list of all individuals beforehand.

## 2. The variables under investigation
- Percentage of:
    - Passengers in each class
    - Survivors and deceased
    - Men and women separately
- The mean value of columns:
    - Age
    - Fare

**For all the sampling methods, the sample size is 50.**
 
## 3. Conclusions on Representativeness

#### Simple Random Sampling
- **Class**: Overrepresented third class, underrepresented second class.
- **Survival**: Close to population percentages.
- **Gender**: Close to population percentages.
- **Age and Fare**: Lower mean age and fare than population.

**Simple Random Sampling** can sometimes lead to over- or under-representation in smaller samples.

---

#### Stratified Random Sampling
- **Class**: Closely matches population distribution (a bit of overestimation of Second Class)
- **Survival**: Higher survival rate.
- **Gender**: Perfect gender balance (50-50).
- **Age and Fare**: Mean values closer to population but slightly lower.

**Stratified Random Sampling** provides the best match for class distribution and gender balance but slightly overestimates survival rates.

---

#### Cluster Sampling
- **Class**: Closely matches population distribution.
- **Survival**: Lower survival rate, higher deceased rate.
- **Gender**: Slightly more balanced than population.
- **Age and Fare**: Mean values very close to population.

**Cluster Sampling** provides good representativeness for all variables.

---

#### Systematic Sampling
- **Class**: Lowered Second Class representation.
- **Survival**: Close to population percentages.
- **Gender**: Very close to population percentages.
- **Age and Fare**: Mean values rather close to population.

**Systematic Sampling** also provides good representativeness for all variables, with mean values very close to population means.

---

Each sampling method has its strengths, but stratified and cluster sampling tend to be more representative of the population characteristics in this case.

## 4. Comparison table

| How exactly was the sample selected? | Percentage of each class  | Percentage of survivors and deceased  | Percentage of men and women  | Mean of age  | Mean of fare  |
|--------------------------------------|---------------------------|----------------------------------------|------------------------------|--------------|---------------|
| **Population**                       | Third: 55.11%<br>First: 24.24%<br>Second: 20.65%  | Survivors: 38.38%<br>Deceased: 61.62%  | Male: 64.76%<br>Female: 35.24%  | 29.70  | 32.20  |
| **Simple Random Sampling**           | Third: 64.0%<br>First: 24.0%<br>Second: 12.0%  | Survivors: 36.0%<br>Deceased: 64.0%  | Male: 68.0%<br>Female: 32.0%  | 26.13  | 22.00  |
| **Stratified Sampling** (strata = male and female groups)              | Third: 52.0%<br>First: 24.0%<br>Second: 24.0%  | Survivors: 46.0%<br>Deceased: 54.0%  | Male: 50.0%<br>Female: 50.0%  | 29.33  | 31.57  |
| **Cluster Sampling** (passenger IDs)                | Third: 54.0%<br>First: 22.0%<br>Second: 24.0%  | Survivors: 32.0%<br>Deceased: 68.0%  | Male: 58.0%<br>Female: 42.0%  | 28.15  | 32.60  |
| **Systematic Sampling**              | Third: 60.0%<br>First: 24.0%<br>Second: 16.0%  | Survivors: 34.0%<br>Deceased: 66.0%  | Male: 64.0%<br>Female: 36.0%  | 28.16  | 28.11  |

In [608]:
import seaborn as sns
import pandas as pd
import random

In [609]:
### Load the Titanic dataset

titanic = sns.load_dataset("titanic")

# Population

In [610]:
### Variables for the entire population

population_size = len(titanic)

percentage_by_class = pd.DataFrame(titanic['class'].value_counts() / population_size * 100).rename(columns={'count': '%'})

survivors_percentage = titanic['survived'].sum() / population_size * 100
deceased_percentage = 100 - survivors_percentage
percentage_survivors_deceased = pd.DataFrame({"Survivors": [survivors_percentage],
                                              "Deceased": [deceased_percentage]}).T.rename(columns={0: '%'})

percentage_gender = pd.DataFrame(titanic['sex'].value_counts() / population_size * 100).rename(columns={'count': '%'})

mean_age = titanic['age'].mean()
mean_fare = titanic['fare'].mean()

In [611]:
print("Percentage of passengers in each class in the entire population:")
percentage_by_class

Percentage of passengers in each class in the entire population:


Unnamed: 0_level_0,%
class,Unnamed: 1_level_1
Third,55.106622
First,24.242424
Second,20.650954


In [612]:
print("Percentage of survivors and deceased in the entire population:")
percentage_survivors_deceased

Percentage of survivors and deceased in the entire population:


Unnamed: 0,%
Survivors,38.383838
Deceased,61.616162


In [613]:
print("Percentage of male and female in the entire population:")
percentage_gender

Percentage of male and female in the entire population:


Unnamed: 0_level_0,%
sex,Unnamed: 1_level_1
male,64.758698
female,35.241302


In [614]:
print(f"""The mean value of:
      - age: {mean_age}
      - fare: {mean_fare}
in the entire population""")

The mean value of:
      - age: 29.69911764705882
      - fare: 32.204207968574636
in the entire population


# Simple Random Sampling

In [615]:
### Variables for simple random sampling

sample_size = 50
simple_random_sample = titanic.sample(sample_size, random_state=42)

sample_percentage_by_class = pd.DataFrame(simple_random_sample['class'].value_counts() / sample_size * 100).rename(columns={'count': '%'})

sample_percentage_survivors = (simple_random_sample['survived'].sum() / sample_size) * 100
sample_percentage_deceased = 100 - sample_percentage_survivors
sample_percentage_survivors_deceased = pd.DataFrame({"Survivors": [sample_percentage_survivors],
                                              "Deceased": [sample_percentage_deceased]}).T.rename(columns={0: '%'})

sample_percentage_gender = pd.DataFrame(simple_random_sample['sex'].value_counts() / sample_size * 100).rename(columns={'count': '%'})

sample_mean_age = simple_random_sample['age'].mean()
sample_mean_fare = simple_random_sample['fare'].mean()

In [616]:
print(f"Percentage of passengers in each class using simple random sampling (size = {sample_size}):")
sample_percentage_by_class

Percentage of passengers in each class using simple random sampling (size = 50):


Unnamed: 0_level_0,%
class,Unnamed: 1_level_1
Third,64.0
First,24.0
Second,12.0


In [617]:
print(f"Percentage of survivors and deceased using simple random sampling (size = {sample_size}):")
sample_percentage_survivors_deceased

Percentage of survivors and deceased using simple random sampling (size = 50):


Unnamed: 0,%
Survivors,36.0
Deceased,64.0


In [618]:
print(f"Percentage of male and female using simple random sampling (size = {sample_size}):")
sample_percentage_gender

Percentage of male and female using simple random sampling (size = 50):


Unnamed: 0_level_0,%
sex,Unnamed: 1_level_1
male,68.0
female,32.0


In [619]:
print(f"""The mean value of:
      - age: {sample_mean_age}
      - fare: {sample_mean_fare}
using simple random sampling (size = {sample_size})""")

The mean value of:
      - age: 26.128205128205128
      - fare: 21.999916
using simple random sampling (size = 50)


#### Simple Random Sampling
- **Class**: Overrepresented third class, underrepresented second class.
- **Survival**: Close to population percentages.
- **Gender**: Close to population percentages.
- **Age and Fare**: Lower mean age and fare than population.

# Stratified Random Sampling
Here strata is based on an equality between male and female groups

In [620]:
### Variables for startified random sampling

strata = titanic['sex'].unique()

stratified_sample = pd.DataFrame()
my_sample_size = 25 # 25 for female and 25 for male
for group in strata:
    stratum = titanic[titanic['sex'] == group]
    if len(stratum) >= my_sample_size:
        stratified_sample = pd.concat([stratified_sample, stratum.sample(n=my_sample_size, random_state=42)])

stratified_percentage_by_class = pd.DataFrame(stratified_sample['class'].value_counts() / stratified_sample.shape[0] * 100).rename(columns={'count': '%'})

stratified_percentage_survivors = (stratified_sample['survived'].sum() / stratified_sample.shape[0]) * 100
stratified_percentage_deceased = 100 - stratified_percentage_survivors
stratified_percentage_survivors_deceased = pd.DataFrame({"Survivors": [stratified_percentage_survivors],
                                              "Deceased": [stratified_percentage_deceased]}).T.rename(columns={0: '%'})

stratified_percentage_gender = pd.DataFrame(stratified_sample['sex'].value_counts() / stratified_sample.shape[0] * 100).rename(columns={'count': '%'})

stratified_mean_age = stratified_sample['age'].mean()
stratified_mean_fare = stratified_sample['fare'].mean()

In [621]:
print(f"Percentage of passengers in each class using stratified random sampling (size = {sample_size}):")
stratified_percentage_by_class

Percentage of passengers in each class using stratified random sampling (size = 50):


Unnamed: 0_level_0,%
class,Unnamed: 1_level_1
Third,52.0
First,24.0
Second,24.0


In [622]:
print(f"Percentage of survivors and deceased using stratified random sampling (size = {sample_size}):")
stratified_percentage_survivors_deceased

Percentage of survivors and deceased using stratified random sampling (size = 50):


Unnamed: 0,%
Survivors,46.0
Deceased,54.0


In [623]:
print(f"Percentage of male and female using stratified random sampling (size = {sample_size}):")
stratified_percentage_gender

Percentage of male and female using stratified random sampling (size = 50):


Unnamed: 0_level_0,%
sex,Unnamed: 1_level_1
male,50.0
female,50.0


In [624]:
print(f"""The mean value of:
      - age: {stratified_mean_age}
      - fare: {stratified_mean_fare}
using stratified random sampling (size = {sample_size})""")

The mean value of:
      - age: 29.329268292682926
      - fare: 31.571002
using stratified random sampling (size = 50)


#### Stratified Random Sampling
- **Class**: Closely matches population distribution (a bit of overestimation of Second Class)
- **Survival**: Higher survival rate.
- **Gender**: Perfect gender balance (50-50).
- **Age and Fare**: Mean values closer to population but slightly lower.

# Cluster Sampling
Clusters based on consecutive passenger IDs

In [625]:
### Variables for cluster sampling

random.seed(42)

cluster_size = 10  # Number of passengers in each cluster
total_clusters = len(titanic) // cluster_size
cluster_ids = list(range(total_clusters))

# Sample 5 clusters, each containing 10 passengers
cluster_sample = pd.DataFrame()
sample_clusters = random.sample(cluster_ids, 5)
for cluster_id in sample_clusters:
    cluster = titanic.iloc[cluster_id * cluster_size : (cluster_id + 1) * cluster_size]
    cluster_sample = pd.concat([cluster_sample, cluster])

cluster_percentage_by_class = pd.DataFrame(cluster_sample['class'].value_counts() / sample_size * 100).rename(columns={'count': '%'})

cluster_percentage_survivors = (cluster_sample['survived'].sum() / sample_size) * 100
cluster_percentage_deceased = 100 - cluster_percentage_survivors
cluster_percentage_survivors_deceased = pd.DataFrame({"Survivors": [cluster_percentage_survivors],
                                              "Deceased": [cluster_percentage_deceased]}).T.rename(columns={0: '%'})

cluster_percentage_gender = pd.DataFrame(cluster_sample['sex'].value_counts() / sample_size * 100).rename(columns={'count': '%'})

cluster_mean_age = cluster_sample['age'].mean()
cluster_mean_fare = cluster_sample['fare'].mean()

In [626]:
print(f"Percentage of passengers in each class using cluster sampling (size = {sample_size}):")
cluster_percentage_by_class

Percentage of passengers in each class using cluster sampling (size = 50):


Unnamed: 0_level_0,%
class,Unnamed: 1_level_1
Third,54.0
Second,24.0
First,22.0


In [627]:
print(f"Percentage of survivors and deceased using cluster sampling (size = {sample_size}):")
cluster_percentage_survivors_deceased

Percentage of survivors and deceased using cluster sampling (size = 50):


Unnamed: 0,%
Survivors,32.0
Deceased,68.0


In [628]:
print(f"Percentage of male and female using cluster sampling (size = {sample_size}):")
cluster_percentage_gender

Percentage of male and female using cluster sampling (size = 50):


Unnamed: 0_level_0,%
sex,Unnamed: 1_level_1
male,58.0
female,42.0


In [629]:
print(f"""The mean value of:
      - age: {cluster_mean_age}
      - fare: {cluster_mean_fare}
using stratified random sampling (size = {sample_size})""")

The mean value of:
      - age: 28.146341463414632
      - fare: 32.59958400000001
using stratified random sampling (size = 50)


#### Cluster Sampling
- **Class**: Closely matches population distribution.
- **Survival**: Lower survival rate, higher deceased rate.
- **Gender**: Slightly more balanced than population.
- **Age and Fare**: Mean values very close to population.

# Systematic Sampling

In [630]:
### Variables for systematic sampling

sampling_interval = len(titanic) // sample_size
systematic_sample = titanic.iloc[::sampling_interval].sample(sample_size, random_state=42)

systematic_percentage_by_class = pd.DataFrame(systematic_sample['class'].value_counts() / sample_size * 100).rename(columns={'count': '%'})

systematic_percentage_survivors = (systematic_sample['survived'].sum() / sample_size) * 100
systematic_percentage_deceased = 100 - systematic_percentage_survivors
systematic_percentage_survivors_deceased = pd.DataFrame({"Survivors": [systematic_percentage_survivors],
                                              "Deceased": [systematic_percentage_deceased]}).T.rename(columns={0: '%'})

systematic_percentage_gender = pd.DataFrame(systematic_sample['sex'].value_counts() / sample_size * 100).rename(columns={'count': '%'})

systematic_mean_age = systematic_sample['age'].mean()
systematic_mean_fare = systematic_sample['fare'].mean()

In [631]:
print(f"Percentage of passengers in each class using systematic sampling (size = {sample_size}):")
systematic_percentage_by_class

Percentage of passengers in each class using systematic sampling (size = 50):


Unnamed: 0_level_0,%
class,Unnamed: 1_level_1
Third,60.0
First,24.0
Second,16.0


In [632]:
print(f"Percentage of survivors and deceased using systematic sampling (size = {sample_size}):")
systematic_percentage_survivors_deceased

Percentage of survivors and deceased using systematic sampling (size = 50):


Unnamed: 0,%
Survivors,34.0
Deceased,66.0


In [633]:
print(f"Percentage of male and female using systematic sampling (size = {sample_size}):")
systematic_percentage_gender

Percentage of male and female using systematic sampling (size = 50):


Unnamed: 0_level_0,%
sex,Unnamed: 1_level_1
male,64.0
female,36.0


In [634]:
print(f"""The mean value of:
      - age: {systematic_mean_age}
      - fare: {systematic_mean_fare}
using systematic random sampling (size = {sample_size})""")

The mean value of:
      - age: 28.1625
      - fare: 28.108078
using systematic random sampling (size = 50)


#### Systematic Sampling
- **Class**: Lowered Second Class representation.
- **Survival**: Close to population percentages.
- **Gender**: Very close to population percentages.
- **Age and Fare**: Mean values rather close to population.