## Before start, please consider:
- This code was transcribed from the page: https://towardsdatascience.com/probability-sampling-with-python-8c977ad78664. The credits of Sampling Codigos are from the Author: Roberto Salazar - Industrial and Systems Engineer | LinkedIn: linkedin.com/in/roberto-salazar-reyna/
- However the codes were applied in the reality of another project. 
- This exercise was done with the intention of clearing doubts about the Sampling theme, at the same time simulating a real project.
- You will see many questions throughout this notebook, as I'm not sure if I would be doing it correctly.
- Check the questions and if possible answer them so it will help everyone who is starting in Sampling. 

### Other Notes: 
- The first part of this document is pre-processing the data, before doing the Sampling Session. Finishing when the chosen sample is ready to be splited into Train and Test.
- My advise is do not drop columns too soon, as we will be doing the Sampling. As a beginner, I've realized that it's best to drops columns during the analysis process and not before, assuming the column won't matter.

##  >>> Let's Start <<<

## Insurance Premium Dataset

### The dataset contains the following information about 79854 policy holders: 
> 1.	id: Unique customer ID
> 2.	perc_premium_paid_by_cash_credit: What % of the premium was paid by cash payments?
> 3.	age_in_days: age of the customer in days 
> 4.	Income: Income of the customer 
5.	Marital Status: Married/Unmarried, Married (1), unmarried (0)
6.	Veh_owned: Number of vehicles owned (1-3)
7.	Count_3-6_months_late: Number of times premium was paid 3-6 months late 
8.	Count_6-12_months_late: Number of times premium was paid 6-12 months late 
9.	Count_more_than_12_months_late: Number of times premium was paid more than 12 months late 
10.	Risk_score: Risk score of customer (similar to credit score)
11.	No_of_dep: Number of dependents in the family of the customer (1-4) 
12.	Accommodation: Owned (1), Rented (0)
13.	no_of_premiums_paid: Number of premiums paid till date 
14.	sourcing_channel: Channel through which customer was sourced 
15.	residence_area_type: Residence type of the customer
16.	premium : Total premium amount paid till now
17.	default: Y variable - 0 indicates that customer has defaulted the premium and 1 indicates that customer has not defaulted the premium


In [62]:
# Libraries to help with reading and manipulating data
import numpy as np
import pandas as pd
# Library to supress the warning
import warnings
warnings.filterwarnings('ignore')

In [63]:
#Loading dataset
data = pd.read_excel('Insurance Premium Default-Dataset.xlsx')

In [64]:
df=data.copy()

### Checking Dataset

In [65]:
df.shape

(79853, 17)

In [66]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 79853 entries, 0 to 79852
Data columns (total 17 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   id                                79853 non-null  int64  
 1   perc_premium_paid_by_cash_credit  79853 non-null  float64
 2   age_in_days                       79853 non-null  int64  
 3   Income                            79853 non-null  int64  
 4   Count_3-6_months_late             79853 non-null  int64  
 5   Count_6-12_months_late            79853 non-null  int64  
 6   Count_more_than_12_months_late    79853 non-null  int64  
 7   Marital Status                    79853 non-null  int64  
 8   Veh_Owned                         79853 non-null  int64  
 9   No_of_dep                         79853 non-null  int64  
 10  Accomodation                      79853 non-null  int64  
 11  risk_score                        79853 non-null  float64
 12  no_o

### Fixing Data Type

In [67]:
cols = df.select_dtypes(['object'])
cols.columns

Index(['sourcing_channel', 'residence_area_type'], dtype='object')

In [68]:
for i in cols.columns:
    df[i] = df[i].astype('category')

In [69]:
df.columns

Index(['id', 'perc_premium_paid_by_cash_credit', 'age_in_days', 'Income',
       'Count_3-6_months_late', 'Count_6-12_months_late',
       'Count_more_than_12_months_late', 'Marital Status', 'Veh_Owned',
       'No_of_dep', 'Accomodation', 'risk_score', 'no_of_premiums_paid',
       'sourcing_channel', 'residence_area_type', 'premium', 'default'],
      dtype='object')

### Checking Missing Values

In [70]:
df.isna().sum()

id                                  0
perc_premium_paid_by_cash_credit    0
age_in_days                         0
Income                              0
Count_3-6_months_late               0
Count_6-12_months_late              0
Count_more_than_12_months_late      0
Marital Status                      0
Veh_Owned                           0
No_of_dep                           0
Accomodation                        0
risk_score                          0
no_of_premiums_paid                 0
sourcing_channel                    0
residence_area_type                 0
premium                             0
default                             0
dtype: int64

### Lets better organize the information about Age, Risk Score and Income
    - age_in_day: Reorganize in 4 Age Range groups to allocate clients YEAR ages. 
    - New columns be created: agerange and Client_age
    - Drop unnecessary column in the end
    ****
    - Risk_Score: Reorganize in 4 Risk Range groups to allocate clients scores.
    - New columns be created: Risk_Range
    - Drop unnecessary column in the end
    ****
    - Income: Reorganize in 2 Income Ranges groups using of "qcut" to define the number of quantiles. 
    - We will use 2 quantiles: Q = 5 and Q = 10, in the end pick the best column. The reason is because the distance of the small number to the large number is high.

> Treating the column about AGE

In [71]:
df['client_age'] = df['age_in_days'] /365

In [72]:
df.client_age

0        31.041096
1        83.038356
2        44.024658
3        65.021918
4        53.041096
           ...    
79848    70.013699
79849    46.019178
79850    68.041096
79851    30.024658
79852    54.027397
Name: client_age, Length: 79853, dtype: float64

In [73]:
df['client_age']
bins = [21, 40, 60, 80, 100]
labels = ['21-40', '41-60', '61-80', '80+']
df['agerange'] = pd.cut(df.client_age, bins, labels = labels,include_lowest = True)

In [74]:
df.agerange

0        21-40
1          80+
2        41-60
3        61-80
4        41-60
         ...  
79848    61-80
79849    41-60
79850    61-80
79851    21-40
79852    41-60
Name: agerange, Length: 79853, dtype: category
Categories (4, object): ['21-40' < '41-60' < '61-80' < '80+']

In [75]:
## Dropping columns which are not adding any information.
df.drop(['age_in_days'],axis=1,inplace=True)

In [76]:
print(df['agerange'].value_counts())

41-60    38556
61-80    21347
21-40    17490
80+       2455
Name: agerange, dtype: int64


> Treating the column about Risk Score

In [77]:
df.risk_score

0        98.810
1        99.066
2        99.170
3        99.370
4        98.800
          ...  
79848    99.080
79849    99.650
79850    99.660
79851    99.460
79852    99.800
Name: risk_score, Length: 79853, dtype: float64

In [78]:
df['risk_score']
bins = [91, 94, 96, 98,99,100]
labels = ['91-93.99', '94-95.99', '96-97.99', '98-98.9','99-100']
df['risk_range'] = pd.cut(df.risk_score, bins, labels = labels,include_lowest = True)

In [79]:
print(df['risk_range'].value_counts())

99-100      52584
98-98.9     22245
96-97.99     4386
94-95.99      478
91-93.99      160
Name: risk_range, dtype: int64


> Treating the column Income

In [80]:
df['Income_Q4'] = pd.qcut(df['Income'], q=4)
df['Income_Q10'] = pd.qcut(df['Income'], q=10)
df.head(10)

Unnamed: 0,id,perc_premium_paid_by_cash_credit,Income,Count_3-6_months_late,Count_6-12_months_late,Count_more_than_12_months_late,Marital Status,Veh_Owned,No_of_dep,Accomodation,...,no_of_premiums_paid,sourcing_channel,residence_area_type,premium,default,client_age,agerange,risk_range,Income_Q4,Income_Q10
0,1,0.317,90050,0,0,0,0,3,3,1,...,8,A,Rural,5400,1,31.041096,21-40,98-98.9,"(24029.999, 108010.0]","(71200.0, 96110.0]"
1,2,0.0,156080,0,0,0,1,3,1,1,...,3,A,Urban,11700,1,83.038356,80+,99-100,"(108010.0, 166560.0]","(142876.0, 166560.0]"
2,3,0.015,145020,1,0,0,0,1,1,1,...,14,C,Urban,18000,1,44.024658,41-60,99-100,"(108010.0, 166560.0]","(142876.0, 166560.0]"
3,4,0.0,187560,0,0,0,1,1,1,0,...,13,A,Urban,13800,1,65.021918,61-80,99-100,"(166560.0, 252090.0]","(166560.0, 195120.0]"
4,5,0.888,103050,7,3,4,0,2,1,0,...,15,A,Urban,7500,0,53.041096,41-60,98-98.9,"(24029.999, 108010.0]","(96110.0, 120080.0]"
5,6,0.512,113500,0,0,0,0,1,4,0,...,4,B,Rural,3300,1,46.013699,41-60,99-100,"(108010.0, 166560.0]","(96110.0, 120080.0]"
6,7,0.0,276240,0,0,0,0,3,4,1,...,8,C,Rural,20100,1,45.013699,41-60,99-100,"(252090.0, 90262600.0]","(231150.0, 279030.0]"
7,8,0.994,84090,0,0,0,0,3,2,0,...,4,A,Urban,3300,1,39.035616,21-40,98-98.9,"(24029.999, 108010.0]","(71200.0, 96110.0]"
8,9,0.019,138330,0,0,0,1,2,4,1,...,8,A,Urban,5400,1,76.038356,61-80,99-100,"(108010.0, 166560.0]","(120080.0, 142876.0]"
9,10,0.018,180100,0,0,0,1,3,3,1,...,8,A,Rural,9600,1,82.027397,80+,99-100,"(166560.0, 252090.0]","(166560.0, 195120.0]"


### Removal of unwanted variables
- After EDA analysis with new groups created we can safe drop columns that will not add strong values. Also, we can consider we are treating the outliear since we decide to reorganize the columns Income in groups and not to exclude certain population.

In [81]:
## Dropping unnecessary columns
df.drop(['risk_score'],axis=1,inplace=True)
df.drop(['Income'],axis=1,inplace=True)

## Sampling Methods

- Sampling is the process of selecting a random number of units from a known dataset.
- Probability sampling: cases when every unit from a given population has the same probability of being selected. This technique includes simple random sampling, systematic sampling, cluster sampling and stratified random sampling.

### Lets prepare the scenario to compare each tecnique in the end. So first, lets created Measures store the Real Mean of a dataset.

In [82]:
#Checking the len of the data. But first make sure you dont have missing values
index = df.index
number_of_rows = len(index)
print(number_of_rows)

79853


In [83]:
#Create a column to use as Measure during the Sampling Process. You will need this to compare each Sampling Technique
measure = np.round(np.random.normal(loc=10, scale=0.5, size = number_of_rows),3)
df['measure'] = measure

In [84]:
# Store the real mean in a separate variable
real_mean = round(df['measure'].mean(),3)

In [85]:
#Making sure all columns we need are there. 
df.columns

Index(['id', 'perc_premium_paid_by_cash_credit', 'Count_3-6_months_late',
       'Count_6-12_months_late', 'Count_more_than_12_months_late',
       'Marital Status', 'Veh_Owned', 'No_of_dep', 'Accomodation',
       'no_of_premiums_paid', 'sourcing_channel', 'residence_area_type',
       'premium', 'default', 'client_age', 'agerange', 'risk_range',
       'Income_Q4', 'Income_Q10', 'measure'],
      dtype='object')

## Questions: 
- As you can see, this approach creates the value Mesuare to compare the methods.
- The value 'size' is the number of rows? 
- The calculation of the real_mean is correct?
- Are there any changes you suggest?

## Simple Random Sampling

In [108]:
# Obtain simple random sample
simple_random_sample = df.sample(n=10000).sort_values(by='id') #here we will request a sample size of 10,000

# Save the sample mean in a separate variable
simple_random_mean = round(simple_random_sample['measure'].mean(),3)

# View sampled data frame
simple_random_sample.head(5)

Unnamed: 0,id,perc_premium_paid_by_cash_credit,Count_3-6_months_late,Count_6-12_months_late,Count_more_than_12_months_late,Marital Status,Veh_Owned,No_of_dep,Accomodation,no_of_premiums_paid,sourcing_channel,residence_area_type,premium,default,client_age,agerange,risk_range,Income_Q4,Income_Q10,measure
1,2,0.0,0,0,0,1,3,1,1,3,A,Urban,11700,1,83.038356,80+,99-100,"(108010.0, 166560.0]","(142876.0, 166560.0]",10.445
10,11,0.5,0,0,0,1,3,1,1,16,A,Rural,13800,1,28.030137,21-40,98-98.9,"(166560.0, 252090.0]","(195120.0, 231150.0]",9.769
20,21,1.0,0,0,1,0,3,4,1,4,C,Urban,11700,1,40.035616,41-60,99-100,"(166560.0, 252090.0]","(195120.0, 231150.0]",9.134
24,25,0.0,0,0,0,0,2,3,1,15,A,Urban,32700,1,60.038356,61-80,99-100,"(252090.0, 90262600.0]","(357414.0, 90262600.0]",10.427
25,26,0.057,0,0,0,1,2,4,0,9,A,Urban,13800,1,56.030137,41-60,98-98.9,"(166560.0, 252090.0]","(166560.0, 195120.0]",10.247


## Questions: 
- N is the number of sample that I want? (in this case you wanted selection of 10k)

## Systematic Sampling

In [109]:
# Define systematic sampling function
def systematic_sampling(df, step):
    
    indexes = np.arange(0,len(df),step=step)
    systematic_sample = df.iloc[indexes]
    return systematic_sample
    
# Obtain a systematic sample and save it in a new variable
systematic_sample = systematic_sampling(df, 3)

# Save the sample mean in a separate variable
systematic_mean = round(systematic_sample['measure'].mean(),3)

# View sampled data frame
systematic_sample.head(5)

Unnamed: 0,id,perc_premium_paid_by_cash_credit,Count_3-6_months_late,Count_6-12_months_late,Count_more_than_12_months_late,Marital Status,Veh_Owned,No_of_dep,Accomodation,no_of_premiums_paid,sourcing_channel,residence_area_type,premium,default,client_age,agerange,risk_range,Income_Q4,Income_Q10,measure
0,1,0.317,0,0,0,0,3,3,1,8,A,Rural,5400,1,31.041096,21-40,98-98.9,"(24029.999, 108010.0]","(71200.0, 96110.0]",10.971
3,4,0.0,0,0,0,1,1,1,0,13,A,Urban,13800,1,65.021918,61-80,99-100,"(166560.0, 252090.0]","(166560.0, 195120.0]",9.017
6,7,0.0,0,0,0,0,3,4,1,8,C,Rural,20100,1,45.013699,41-60,99-100,"(252090.0, 90262600.0]","(231150.0, 279030.0]",9.897
9,10,0.018,0,0,0,1,3,3,1,8,A,Rural,9600,1,82.027397,80+,99-100,"(166560.0, 252090.0]","(166560.0, 195120.0]",9.577
12,13,0.015,0,0,0,1,3,1,0,6,D,Rural,13800,1,58.016438,41-60,99-100,"(166560.0, 252090.0]","(195120.0, 231150.0]",10.016


## Questions: 
- What is step in the line "def systematic_sampling(df, step)"" and "indexes = np.arange(0,len(df),step=step)"? 
- Step is default or I have a possibility to place another value when "step = step"? Can I use step = something else?
- How can I be sure I'm taking a 10k sample?
- How can I choose my sample size when it comes to Systematic Sample


## Clustering Sampling

In [110]:
def cluster_sampling(df, number_of_clusters):
    
    try:
        # Divide the units into cluster of equal size
        df['cluster_id'] = np.repeat([range(1,number_of_clusters+1)],len(df)/number_of_clusters)

        # Create an empty list
        indexes = []

        # Append the indexes from the clusters that meet the criteria
        # For this formula, clusters id must be an even number
        for i in range(0,len(df)):
            if df['cluster_id'].iloc[i]%2 == 0:
                indexes.append(i)
        cluster_sample = df.iloc[indexes]
        return(cluster_sample)
    
    except:
        print("The population cannot be divided into clusters of equal size!")
        
# Obtain a cluster sample and save it in a new variable
cluster_sample = cluster_sampling(df,6)

# Save the sample mean in a separate variable
#cluster_mean = round(cluster_sample['measure'].mean(),3)

# View sampled data frame
cluster_sample

The population cannot be divided into clusters of equal size!


## Questions: 
- Why in this case it was not possible to perform the Clustering Sampling?
- Which database would then be better to use this method? 
- Does it have to be a divisible number for clustering?
- In this code where's the line for me to put the desired sample number and the desired cluster number? How can I choose a 10k sample?
- What do I have to change in this code so I can get a result?

## Stratified Random Sampling

In [111]:
# Import StratifiedShuffleSplit
from sklearn.model_selection import StratifiedShuffleSplit

In [112]:
# the functions:
def stratified_sample(df, strata, size=None, seed=None, keep_index= True):
    '''
    It samples data from a pandas dataframe using strata. These functions use
    proportionate stratification:
    n1 = (N1/N) * n
    where:
        - n1 is the sample size of stratum 1
        - N1 is the population size of stratum 1
        - N is the total population size
        - n is the sampling size
    Parameters
    ----------
    :df: pandas dataframe from which data will be sampled.
    :strata: list containing columns that will be used in the stratified sampling.
    :size: sampling size. If not informed, a sampling size will be calculated
        using Cochran adjusted sampling formula:
        cochran_n = (Z**2 * p * q) /e**2
        where:
            - Z is the z-value. In this case we use 1.96 representing 95%
            - p is the estimated proportion of the population which has an
                attribute. In this case we use 0.5
            - q is 1-p
            - e is the margin of error
        This formula is adjusted as follows:
        adjusted_cochran = cochran_n / 1+((cochran_n -1)/N)
        where:
            - cochran_n = result of the previous formula
            - N is the population size
    :seed: sampling seed
    :keep_index: if True, it keeps a column with the original population index indicator
    
    Returns
    -------
    A sampled pandas dataframe based in a set of strata.
    Examples
    --------
    >> df.head()
    	id  sex age city 
    0	123 M   20  XYZ
    1	456 M   25  XYZ
    2	789 M   21  YZX
    3	987 F   40  ZXY
    4	654 M   45  ZXY
    ...
    # This returns a sample stratified by sex and city containing 30% of the size of
    # the original data
    >> stratified = stratified_sample(df=df, strata=['sex', 'city'], size=0.3)
    Requirements
    ------------
    - pandas
    - numpy
    '''
    population = len(df)
    size = __smpl_size(population, size)
    tmp = df[strata]
    tmp['size'] = 1
    tmp_grpd = tmp.groupby(strata).count().reset_index()
    tmp_grpd['samp_size'] = round(size/population * tmp_grpd['size']).astype(int)

    # controlling variable to create the dataframe or append to it
    first = True 
    for i in range(len(tmp_grpd)):
        # query generator for each iteration
        qry=''
        for s in range(len(strata)):
            stratum = strata[s]
            value = tmp_grpd.iloc[i][stratum]
            n = tmp_grpd.iloc[i]['samp_size']

            if type(value) == str:
                value = "'" + str(value) + "'"
            
            if s != len(strata)-1:
                qry = qry + stratum + ' == ' + str(value) +' & '
            else:
                qry = qry + stratum + ' == ' + str(value)
        
        # final dataframe
        if first:
            stratified_df = df.query(qry).sample(n=n, random_state=seed).reset_index(drop=(not keep_index))
            first = False
        else:
            tmp_df = df.query(qry).sample(n=n, random_state=seed).reset_index(drop=(not keep_index))
            stratified_df = stratified_df.append(tmp_df, ignore_index=True)
    
    return stratified_df



def stratified_sample_report(df, strata, size=None):
    '''
    Generates a dataframe reporting the counts in each stratum and the counts
    for the final sampled dataframe.
    Parameters
    ----------
    :df: pandas dataframe from which data will be sampled.
    :strata: list containing columns that will be used in the stratified sampling.
    :size: sampling size. If not informed, a sampling size will be calculated
        using Cochran adjusted sampling formula:
        cochran_n = (Z**2 * p * q) /e**2
        where:
            - Z is the z-value. In this case we use 1.96 representing 95%
            - p is the estimated proportion of the population which has an
                attribute. In this case we use 0.5
            - q is 1-p
            - e is the margin of error
        This formula is adjusted as follows:
        adjusted_cochran = cochran_n / 1+((cochran_n -1)/N)
        where:
            - cochran_n = result of the previous formula
            - N is the population size
    Returns
    -------
    A dataframe reporting the counts in each stratum and the counts
    for the final sampled dataframe.
    '''
    population = len(df)
    size = __smpl_size(population, size)
    tmp = df[strata]
    tmp['size'] = 1
    tmp_grpd = tmp.groupby(strata).count().reset_index()
    tmp_grpd['samp_size'] = round(size/population * tmp_grpd['size']).astype(int)
    return tmp_grpd


def __smpl_size(population, size):
    '''
    A function to compute the sample size. If not informed, a sampling 
    size will be calculated using Cochran adjusted sampling formula:
        cochran_n = (Z**2 * p * q) /e**2
        where:
            - Z is the z-value. In this case we use 1.96 representing 95%
            - p is the estimated proportion of the population which has an
                attribute. In this case we use 0.5
            - q is 1-p
            - e is the margin of error
        This formula is adjusted as follows:
        adjusted_cochran = cochran_n / 1+((cochran_n -1)/N)
        where:
            - cochran_n = result of the previous formula
            - N is the population size
    Parameters
    ----------
        :population: population size
        :size: sample size (default = None)
    Returns
    -------
    Calculated sample size to be used in the functions:
        - stratified_sample
        - stratified_sample_report
    '''
    if size is None:
        cochran_n = round(((1.96)**2 * 0.5 * 0.5)/ 0.02**2)
        n = round(cochran_n/(1+((cochran_n -1) /population)))
    elif size >= 0 and size < 1:
        n = round(population * size)
    elif size < 0:
        raise ValueError('Parameter "size" must be an integer or a proportion between 0 and 0.99.')
    elif size >= 1:
        n = size
    return n

Questions: 
- This code Stratified is from the page https://www.kaggle.com/flaviobossolan/stratified-sampling-python. 
- As you can see, this code has 3 functions that can give you more information than only a sample of the data.( Check the link above to learn more)
- However in this case we want only to create the sample, using the next cell bellow. 
- What is 'strata' and how to choose them? Can I use all columns of the data as strata?
- How to make a new strata column (if is possible)?
- What is seed and how to choose the best seed?
- In the functions def stratified_sample(df, strata, size=None, seed=None, keep_index= True), there is a line tmp['size'] = 1, what is 1?
- Do you have a Stratified Sampling  Python code, more simple than that?

In [113]:
# creating a sample
stratified_sample_df = stratified_sample(df, ['default', 'agerange', 'risk_range'], size=10000, seed=123, keep_index= True)
stratified_sample_df.head()

Unnamed: 0,index,id,perc_premium_paid_by_cash_credit,Count_3-6_months_late,Count_6-12_months_late,Count_more_than_12_months_late,Marital Status,Veh_Owned,No_of_dep,Accomodation,...,sourcing_channel,residence_area_type,premium,default,client_age,agerange,risk_range,Income_Q4,Income_Q10,measure
0,49405,49406,0.187,3,0,0,0,2,1,1,...,B,Urban,7500,0,36.030137,21-40,94-95.99,"(24029.999, 108010.0]","(71200.0, 96110.0]",9.542
1,73682,73683,0.521,0,0,0,1,1,3,0,...,A,Rural,5700,0,38.032877,21-40,94-95.99,"(24029.999, 108010.0]","(24029.999, 71200.0]",10.545
2,3067,3068,0.563,0,0,0,0,1,4,0,...,C,Urban,9600,0,35.027397,21-40,96-97.99,"(166560.0, 252090.0]","(166560.0, 195120.0]",10.698
3,59566,59567,0.32,2,1,0,1,3,4,0,...,A,Rural,5700,0,34.032877,21-40,96-97.99,"(24029.999, 108010.0]","(71200.0, 96110.0]",10.469
4,47589,47590,0.95,0,0,0,0,3,2,1,...,A,Rural,1200,0,28.021918,21-40,96-97.99,"(24029.999, 108010.0]","(24029.999, 71200.0]",10.712


In [114]:
# Save the sample mean in a separate variable
stratified_sample_mean = round(stratified_sample_df['measure'].mean(),3)
# View sampled data frame
stratified_sample_df.head()

Unnamed: 0,index,id,perc_premium_paid_by_cash_credit,Count_3-6_months_late,Count_6-12_months_late,Count_more_than_12_months_late,Marital Status,Veh_Owned,No_of_dep,Accomodation,...,sourcing_channel,residence_area_type,premium,default,client_age,agerange,risk_range,Income_Q4,Income_Q10,measure
0,49405,49406,0.187,3,0,0,0,2,1,1,...,B,Urban,7500,0,36.030137,21-40,94-95.99,"(24029.999, 108010.0]","(71200.0, 96110.0]",9.542
1,73682,73683,0.521,0,0,0,1,1,3,0,...,A,Rural,5700,0,38.032877,21-40,94-95.99,"(24029.999, 108010.0]","(24029.999, 71200.0]",10.545
2,3067,3068,0.563,0,0,0,0,1,4,0,...,C,Urban,9600,0,35.027397,21-40,96-97.99,"(166560.0, 252090.0]","(166560.0, 195120.0]",10.698
3,59566,59567,0.32,2,1,0,1,3,4,0,...,A,Rural,5700,0,34.032877,21-40,96-97.99,"(24029.999, 108010.0]","(71200.0, 96110.0]",10.469
4,47589,47590,0.95,0,0,0,0,3,2,1,...,A,Rural,1200,0,28.021918,21-40,96-97.99,"(24029.999, 108010.0]","(24029.999, 71200.0]",10.712


In [115]:
print("Measure Mean Comparison per Sampling Method")

# Create a dictionary with the mean outcomes for each sampling method and the real mean
outcomes = {'sample_mean':[simple_random_mean,systematic_mean,stratified_sample_mean],
           'real_mean': real_mean}

# Transform dictionary into a data frame
outcomes = pd.DataFrame(outcomes, index=['Simple Random Sampling','Systematic Sampling','stratified_sample_mean'])

# Add a value corresponding to the absolute error
outcomes['abs_error'] = abs(outcomes['real_mean'] - outcomes['sample_mean'])

# Sort data frame by absolute error
outcomes.sort_values(by='abs_error')


Measure Mean Comparison per Sampling Method


Unnamed: 0,sample_mean,real_mean,abs_error
Simple Random Sampling,10.002,10.002,0.0
Systematic Sampling,10.003,10.002,0.001
stratified_sample_mean,10.004,10.002,0.002


- According to the Measure Mean Comparison per Sampling Method Table, the measure mean of the sample obtained through the Simple Random Sampling technique was the closest one to the real mean, with an absolute error of 0.000 units.
- (Remember that every time you run sampling, new samples will be created and giving other results.)

## Replace Structure

In [95]:
# using the sample dataset
simple_random_sample.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10000 entries, 16 to 79847
Data columns (total 20 columns):
 #   Column                            Non-Null Count  Dtype   
---  ------                            --------------  -----   
 0   id                                10000 non-null  int64   
 1   perc_premium_paid_by_cash_credit  10000 non-null  float64 
 2   Count_3-6_months_late             10000 non-null  int64   
 3   Count_6-12_months_late            10000 non-null  int64   
 4   Count_more_than_12_months_late    10000 non-null  int64   
 5   Marital Status                    10000 non-null  int64   
 6   Veh_Owned                         10000 non-null  int64   
 7   No_of_dep                         10000 non-null  int64   
 8   Accomodation                      10000 non-null  int64   
 9   no_of_premiums_paid               10000 non-null  int64   
 10  sourcing_channel                  10000 non-null  category
 11  residence_area_type               10000 non-null  cat

In [96]:
#Placing columns in ONEHOTCOls
oneHotCols=["residence_area_type","Count_3-6_months_late","Count_6-12_months_late","Count_more_than_12_months_late","agerange","risk_range",'sourcing_channel',"Income_Q4","Income_Q10"]

In [97]:
# Dumming the sample dataset
simple_random_sample=pd.get_dummies(simple_random_sample, columns=oneHotCols)
simple_random_sample.head(5)

Unnamed: 0,id,perc_premium_paid_by_cash_credit,Marital Status,Veh_Owned,No_of_dep,Accomodation,no_of_premiums_paid,premium,default,client_age,...,"Income_Q10_(24029.999, 71200.0]","Income_Q10_(71200.0, 96110.0]","Income_Q10_(96110.0, 120080.0]","Income_Q10_(120080.0, 142876.0]","Income_Q10_(142876.0, 166560.0]","Income_Q10_(166560.0, 195120.0]","Income_Q10_(195120.0, 231150.0]","Income_Q10_(231150.0, 279030.0]","Income_Q10_(279030.0, 357414.0]","Income_Q10_(357414.0, 90262600.0]"
16,17,0.03,0,3,3,1,14,18000,1,60.041096,...,0,0,0,0,1,0,0,0,0,0
20,21,1.0,0,3,4,1,4,11700,1,40.035616,...,0,0,0,0,0,0,1,0,0,0
31,32,0.031,1,2,4,0,13,9600,1,49.038356,...,0,0,0,0,0,0,0,1,0,0
35,36,0.942,1,2,2,1,5,11700,1,45.038356,...,0,0,0,0,0,0,1,0,0,0
36,37,0.07,1,2,3,0,23,11700,1,65.038356,...,0,0,0,0,0,1,0,0,0,0


### > NOW THE SAMPLE DATA IS READY TO SPLIT