# Exploring Statistical Concepts and Sampling Techniques with the Iris Dataset

In this notebook, we'll explore some fundamental statistical concepts such as mean, median, mode, range, and variance using the Iris dataset. Additionally, we'll demonstrate four different sampling techniques: simple random sampling, systematic sampling, stratified sampling, and cluster sampling.

## 1. Loading and Exploring the Iris Dataset

The Iris dataset consists of 150 samples of iris flowers, each with four features: sepal length, sepal width, petal length, and petal width. The goal is to classify the flowers into one of three species: setosa, versicolor, or virginica.

In [None]:
# Import necessary libraries
import pandas as pd
from sklearn.datasets import load_iris
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns

# Load the Iris dataset
iris = load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['species'] = iris.target

# Display the first few rows of the dataset
df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),species
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


## 2. Statistical Concepts

Let's calculate the mean, median, mode, range, and variance for each of the feature columns in the Iris dataset.

In [None]:
# Calculate mean, median, and mode for each feature column
mean_values = df.mean()
median_values = df.median()
mode_values = df.mode().iloc[0]

print('Mean values:\n', mean_values)
print('\nMedian values:\n', median_values)
print('\nMode values:\n', mode_values)

Mean values:
 sepal length (cm)    5.843333
sepal width (cm)     3.057333
petal length (cm)    3.758000
petal width (cm)     1.199333
species              1.000000
dtype: float64

Median values:
 sepal length (cm)    5.80
sepal width (cm)     3.00
petal length (cm)    4.35
petal width (cm)     1.30
species              1.00
dtype: float64

Mode values:
 sepal length (cm)    5.0
sepal width (cm)     3.0
petal length (cm)    1.4
petal width (cm)     0.2
species              0.0
Name: 0, dtype: float64


In [None]:
# Calculate range and variance for each feature column
range_values = df.max() - df.min()
variance_values = df.var()

print('\nRange values:\n', range_values)
print('\nVariance values:\n', variance_values)


Range values:
 sepal length (cm)    3.6
sepal width (cm)     2.4
petal length (cm)    5.9
petal width (cm)     2.4
species              2.0
dtype: float64

Variance values:
 sepal length (cm)    0.685694
sepal width (cm)     0.189979
petal length (cm)    3.116278
petal width (cm)     0.581006
species              0.671141
dtype: float64


## 3. Sampling Techniques

We'll demonstrate the following sampling techniques:

1. Simple Random Sampling
2. Systematic Sampling
3. Stratified Sampling
4. Cluster Sampling

### 3.1 Simple Random Sampling

Simple random sampling is a basic sampling technique where each sample has an equal probability of being chosen.

In [None]:
# Simple Random Sampling
simple_random_sample = df.sample(n=30, random_state=2024)
print('Simple Random Sampling:\n', simple_random_sample['species'].value_counts())

Simple Random Sampling:
 species
0    12
2    10
1     8
Name: count, dtype: int64


### 3.2 Systematic Sampling

Systematic sampling is a sampling technique where samples are selected at regular intervals from the population.

In [None]:
# Systematic Sampling
interval = len(df) // 15
systematic_sample = df.iloc[::interval]
print('Systematic Sampling:\n', systematic_sample['species'].value_counts())

Systematic Sampling:
 species
0    5
1    5
2    5
Name: count, dtype: int64


### 3.3 Stratified Sampling

Stratified sampling is a sampling technique where the population is divided into strata (subgroups) and samples are taken from each stratum.

In [None]:
# Stratified Sampling
from sklearn.model_selection import train_test_split

# Perform stratified sampling to maintain the same class distribution
stratified_sample, _ = train_test_split(df, test_size=0.8, stratify=df['species'], random_state=2024)
print('Stratified Sampling:\n', stratified_sample['species'].value_counts())

Stratified Sampling:
 species
2    10
0    10
1    10
Name: count, dtype: int64


### 3.4 Cluster Sampling

Cluster sampling is a sampling technique where the population is divided into clusters (groups) and a random sample of clusters is selected.

In [None]:
# Cluster Sampling
# For demonstration purposes, we'll treat the species as clusters
cluster_sample = df[df['species'].isin(np.random.choice(df['species'].unique(), 1))]
print('Cluster Sampling:\n', cluster_sample['species'].value_counts())

Cluster Sampling:
 species
2    50
Name: count, dtype: int64


In [None]:
## End of Script