# Predicting T-shirt size using the ANSUR II dataset
We will here try to predict a persons t-shirt size given the weight and height of the person. We will use the ANSUR II dataset which contains a lot of information about the physical attributes of a large number of people. 

We will first try to map the persons in the dataset to a t-shirt size. It is hard to find a concise size chart for t-shirt so we will create our own, initial chart, based on these assumptions:

We will only look at two measurements, Shoulder Width and Chest Circumference.

Our first problem is that Shoulder Width is not one of the measurements taken in the dataset. But we have Biacromial Breadth which is the distance between the two acromion processes. We will assume that this is the same as Shoulder Width.

We will then have these initial rules:

| Size | Percentile |
|------|------------|
| XS   | 0-5        |
| S    | 5-25       |
| M    | 25-50      |
| L    | 50-75      |
| XL   | 75-90      |
| XXL  | 90-97      |
| XXXL | 97-100     |

## Inspect the data

In [1]:
import pandas as pd

In [2]:
female = pd.read_csv('./data/female.csv')

male = pd.read_csv('./data/male.csv')


In [3]:
print(f'For women we have (rows, columns) {female.shape}')

print(f'For men we have (rows, columns) {male.shape}')

For women we have (rows, columns) (1986, 108)
For men we have (rows, columns) (4082, 108)


## Checking the percentiles

Let us determine the percentiles of the data

In [4]:
def compute_percentile_ranges(column):
    #Define percentile ranges
    ranges = [(0, 5), (5, 25), (25, 50), (50, 75), (75, 90), (90, 97), (97, 100)]

    percentiles = {(low, high): (column.quantile(low/100), column.quantile(high/100)) for low, high in ranges}

    counts = {}

    for r, (low, high) in percentiles.items():
        counts[r] = ((column >= low) & (column < high)).sum()
    
    return counts


print(compute_percentile_ranges(female['chestcircumference']))
print(compute_percentile_ranges(female['biacromialbreadth']))

print(compute_percentile_ranges(male['chestcircumference']))
print(compute_percentile_ranges(male['biacromialbreadth']))





{(0, 5): np.int64(100), (5, 25): np.int64(396), (25, 50): np.int64(492), (50, 75): np.int64(499), (75, 90): np.int64(299), (90, 97): np.int64(140), (97, 100): np.int64(59)}
{(0, 5): np.int64(93), (5, 25): np.int64(377), (25, 50): np.int64(477), (50, 75): np.int64(541), (75, 90): np.int64(297), (90, 97): np.int64(139), (97, 100): np.int64(61)}
{(0, 5): np.int64(199), (5, 25): np.int64(810), (25, 50): np.int64(1025), (50, 75): np.int64(1012), (75, 90): np.int64(616), (90, 97): np.int64(295), (97, 100): np.int64(124)}
{(0, 5): np.int64(191), (5, 25): np.int64(787), (25, 50): np.int64(989), (50, 75): np.int64(1079), (75, 90): np.int64(610), (90, 97): np.int64(303), (97, 100): np.int64(122)}


## Generate the t-shirt size chart

In [5]:
def comput_size_percentile_measurements(data, chest_column, shoulder_column):
    sizes = ['XS', 'S', 'M', 'L', 'XL', '2XL', '3XL']
    ranges = [0, 5, 25 , 50, 75, 90, 97]

    # Compute the values for each percentile for chest and shoulder
    chest_percentiles = {p: data[chest_column].quantile(p/100) for p in ranges}
    shoulder_percentiles = {p: data[shoulder_column].quantile(p/100) for p in ranges}

    # Map the t-shirt sizes to the corresponding chest and shoulder measurements
    size_mappings = {}
    for i, size in enumerate(sizes):
        size_mappings[size] = {
            'Chest': int(chest_percentiles[ranges[i]]),
            'Shoulder': int(shoulder_percentiles[ranges[i]])
        }
    
    return size_mappings



print(comput_size_percentile_measurements(female, 'chestcircumference', 'biacromialbreadth'))
print(comput_size_percentile_measurements(male, 'chestcircumference', 'biacromialbreadth'))

{'XS': {'Chest': 695, 'Shoulder': 283}, 'S': {'Chest': 824, 'Shoulder': 335}, 'M': {'Chest': 889, 'Shoulder': 353}, 'L': {'Chest': 940, 'Shoulder': 365}, 'XL': {'Chest': 999, 'Shoulder': 378}, '2XL': {'Chest': 1057, 'Shoulder': 389}, '3XL': {'Chest': 1117, 'Shoulder': 400}}
{'XS': {'Chest': 774, 'Shoulder': 337}, 'S': {'Chest': 922, 'Shoulder': 384}, 'M': {'Chest': 996, 'Shoulder': 403}, 'L': {'Chest': 1056, 'Shoulder': 415}, 'XL': {'Chest': 1117, 'Shoulder': 428}, '2XL': {'Chest': 1172, 'Shoulder': 441}, '3XL': {'Chest': 1233, 'Shoulder': 452}}


In [6]:

female_sizes = {
    'XS': {'Chest': 695, 'Shoulder': 283}, # Chest: 695, 829
    'S': {'Chest': 824, 'Shoulder': 335}, # Chest: 819, 894
    'M': {'Chest': 889, 'Shoulder': 353}, 
    'L': {'Chest': 940, 'Shoulder': 365}, 
    'XL': {'Chest': 999, 'Shoulder': 378}, 
    '2XL': {'Chest': 1057, 'Shoulder': 389}, 
    '3XL': {'Chest': 1117, 'Shoulder': 400} 
    }

male_sizes = {
    'XS': {'Chest': 774, 'Shoulder': 337}, 
    'S': {'Chest': 922, 'Shoulder': 384}, 
    'M': {'Chest': 996, 'Shoulder': 403}, 
    'L': {'Chest': 1056, 'Shoulder': 415}, 
    'XL': {'Chest': 1117, 'Shoulder': 428}, 
    '2XL': {'Chest': 1172, 'Shoulder': 441}, 
    '3XL': {'Chest': 1233, 'Shoulder': 452}
    }

In [7]:
def get_size(data, size_chart):
    matches = {size: 0 for size in size_chart.keys()}
    ties = 0

    for _, row in data.iterrows():
        possible_sizes = []

        for size, measurements in size_chart.items():
            if (row['biacromialbreadth'] <= measurements['Shoulder'] and
                row['chestcircumference'] <= measurements['Chest']):
                possible_sizes.append(size)
        
        if len(possible_sizes) == 1:
            matches[possible_sizes[0]] += 1
        elif len(possible_sizes) > 1:
            ties += 1
    
    return matches, ties

In [8]:
female_matches, female_ties = get_size(female, female_sizes)
male_matches, male_ties = get_size(male, male_sizes)

print('Female matches: ', female_matches)
print('Female ties: ', female_ties)
print('Male matches: ', male_matches)
print('Male ties: ', male_ties)


Female matches:  {'XS': 0, 'S': 0, 'M': 0, 'L': 0, 'XL': 0, '2XL': 0, '3XL': 236}
Female ties:  1642
Male matches:  {'XS': 0, 'S': 0, 'M': 0, 'L': 0, 'XL': 0, '2XL': 0, '3XL': 434}
Male ties:  3437


This is not good. Let us have overlapping measurements.

In [9]:
def create_overlapping_size_chart(original_chart):
    overlapping_chart = {}

    sizes = list(original_chart.keys())

    for i, size in enumerate(sizes):
        overlapping_chart[size] = {}
        if i == 0:
            overlapping_chart[size]['Chest'] = [original_chart[size]['Chest'], original_chart[sizes[i+1]]['Chest']+5]
            overlapping_chart[size]['Shoulder'] = [original_chart[size]['Shoulder'], original_chart[sizes[i+1]]['Shoulder']+5]

        elif i ==len(sizes)-1:
            overlapping_chart[size]['Chest'] = [original_chart[size]['Chest']-5, original_chart[size]['Chest']+1000]
            overlapping_chart[size]['Shoulder'] = [original_chart[size]['Shoulder']-5, original_chart[size]['Shoulder']+1000]

        else:
            overlapping_chart[size]['Chest'] = [original_chart[size]['Chest']-5, original_chart[sizes[i+1]]['Chest']+5]
            overlapping_chart[size]['Shoulder'] = [original_chart[size]['Shoulder']-5, original_chart[sizes[i+1]]['Shoulder']+5]
        
    return overlapping_chart





In [10]:
new_female_sizes = create_overlapping_size_chart(female_sizes)
new_male_sizes = create_overlapping_size_chart(male_sizes)

for k, v in new_female_sizes.items():
    print(f"'{k}' : {v}, ")

print()

for k, v in new_male_sizes.items():
    print(f"'{k}' : {v}, ")




'XS' : {'Chest': [695, 829], 'Shoulder': [283, 340]}, 
'S' : {'Chest': [819, 894], 'Shoulder': [330, 358]}, 
'M' : {'Chest': [884, 945], 'Shoulder': [348, 370]}, 
'L' : {'Chest': [935, 1004], 'Shoulder': [360, 383]}, 
'XL' : {'Chest': [994, 1062], 'Shoulder': [373, 394]}, 
'2XL' : {'Chest': [1052, 1122], 'Shoulder': [384, 405]}, 
'3XL' : {'Chest': [1112, 2117], 'Shoulder': [395, 1400]}, 

'XS' : {'Chest': [774, 927], 'Shoulder': [337, 389]}, 
'S' : {'Chest': [917, 1001], 'Shoulder': [379, 408]}, 
'M' : {'Chest': [991, 1061], 'Shoulder': [398, 420]}, 
'L' : {'Chest': [1051, 1122], 'Shoulder': [410, 433]}, 
'XL' : {'Chest': [1112, 1177], 'Shoulder': [423, 446]}, 
'2XL' : {'Chest': [1167, 1238], 'Shoulder': [436, 457]}, 
'3XL' : {'Chest': [1228, 2233], 'Shoulder': [447, 1452]}, 


In [11]:
female_sizes = {
'XS' : {'Chest': [695, 829], 'Shoulder': [283, 340]}, 
'S' : {'Chest': [819, 894], 'Shoulder': [330, 358]}, 
'M' : {'Chest': [884, 945], 'Shoulder': [348, 370]}, 
'L' : {'Chest': [935, 1004], 'Shoulder': [360, 383]}, 
'XL' : {'Chest': [994, 1062], 'Shoulder': [373, 394]}, 
'2XL' : {'Chest': [1052, 1122], 'Shoulder': [384, 405]}, 
'3XL' : {'Chest': [1112, 2117], 'Shoulder': [395, 1400]}
}

male_sizes = {
'XS' : {'Chest': [774, 927], 'Shoulder': [337, 389]}, 
'S' : {'Chest': [917, 1001], 'Shoulder': [379, 408]}, 
'M' : {'Chest': [991, 1061], 'Shoulder': [398, 420]}, 
'L' : {'Chest': [1051, 1122], 'Shoulder': [410, 433]}, 
'XL' : {'Chest': [1112, 1177], 'Shoulder': [423, 446]}, 
'2XL' : {'Chest': [1167, 1238], 'Shoulder': [436, 457]}, 
'3XL' : {'Chest': [1228, 2233], 'Shoulder': [447, 1452]}
}

1.22 lab 为新的重叠尺寸图表编写匹配函数，并进行初步分析

In [12]:
#为新的重叠尺寸图表计算匹配和并列数量
#新图表中每个尺寸的胸围和肩宽都是一个范围 [下限, 上限]

def get_size_with_overlap(data, size_chart):
    """
    参数:
        data: 包含'chestcircumference'和'biacromialbreadth'列的DataFrame
        size_chart: 新的大小图表,格式如female_sizes或male_sizes,每个尺寸的值是一个列表[min, max]
    返回:
        matches: 字典，记录每个尺寸唯一匹配的人数
        ties: 整数，记录同时符合多个尺寸条件的人数
    """
    matches = {size: 0 for size in size_chart.keys()}
    ties = 0

    for _, row in data.iterrows():
        possible_sizes = []
        chest_val = row['chestcircumference']
        shoulder_val = row['biacromialbreadth']
        
        # 检查数据点落在哪个尺寸的范围内
        for size, measurements in size_chart.items():
            chest_range = measurements['Chest']
            shoulder_range = measurements['Shoulder']
            # 判断数据点是否在当前尺寸的胸围和肩宽范围内
            if (chest_range[0] <= chest_val <= chest_range[1]) and \
               (shoulder_range[0] <= shoulder_val <= shoulder_range[1]):
                possible_sizes.append(size)
        
        # 统计：唯一匹配 或 并列
        if len(possible_sizes) == 1:
            matches[possible_sizes[0]] += 1
        elif len(possible_sizes) > 1:
            ties += 1
    
    return matches, ties



In [13]:
#对女性和男性数据集运行新函数

new_female_matches, new_female_ties = get_size_with_overlap(female, female_sizes)
new_male_matches, new_male_ties = get_size_with_overlap(male, male_sizes)

print('Female matches with new overlapping chart: ', new_female_matches)
print('Female ties with new overlapping chart: ', new_female_ties)
print()
print('Male matches with new overlapping chart: ', new_male_matches)
print('Male ties with new overlapping chart: ', new_male_ties)

Female matches with new overlapping chart:  {'XS': 23, 'S': 180, 'M': 230, 'L': 248, 'XL': 108, '2XL': 30, '3XL': 11}
Female ties with new overlapping chart:  67

Male matches with new overlapping chart:  {'XS': 63, 'S': 419, 'M': 542, 'L': 532, 'XL': 287, '2XL': 88, '3XL': 47}
Male ties with new overlapping chart:  166


In [14]:
#简要分析
print(f"female dataset: total number {len(female)}, matches are {sum(new_female_matches.values())} , ties are {new_female_ties} ")
print(f"male dataset: total number {len(male)}, matches are {sum(new_male_matches.values())} ,ties are  {new_male_ties} ")




female dataset: total number 1986, matches are 830 , ties are 67 
male dataset: total number 4082, matches are 1978 ,ties are  166 


1.23

In [None]:
#Task 1: 计算并打印关键指标对比 calculate and print the key value to compare 

# 1：计算并对比新旧方法的匹配率与未分类人数 Calculate and compare the matching rate of the old and new methods with the number of unclassified individuals.
def calculate_metrics(data, matches_dict, ties_count, method_name):
    """
    计算并打印一种分类方法的核心指标。
    Calculate and print the core metrics of a classification method.
    """
    total = len(data)
    unique_matches = sum(matches_dict.values())
    unmatched = total - unique_matches - ties_count # 计算既不是唯一匹配也不是并列的人数Calculate the number of people who are neither uniquely matched nor tied.
    
    match_rate = (unique_matches / total) * 100 if total > 0 else 0
    
    print(f"【{method_name}】")
    print(f" Number of unique matches: {unique_matches} ({match_rate:.1f}%)")
    print(f" ties number: {ties_count}") #并列人数
    print(f" unmatched number: {unmatched}")
    print()
    
    return {
        'Method': method_name,
        'Unique Matches': unique_matches,
        'Ties': ties_count,
        'Unmatched': unmatched,
        'Match Rate %': round(match_rate, 1)
    }

print("Comparison of classification results for Female datasets")

# 重新运行旧方法以获取结果（确保使用原非重叠的尺寸字典：female_sizes_original, male_sizes_original）
# 注意：需要先定义回原来的非重叠尺寸字典，或直接使用第一部分开头计算出的`new_female_matches`等变量进行对比。
# 此处假设保留了旧的`female_sizes`和`male_sizes`字典（非重叠的），并运行了旧的`get_size`函数。
# 为清晰起见，这里直接调用计算函数。需要确保相关的字典和函数已定义。
# 调用方式示例（如果变量存在）:
#old_female_metrics = calculate_metrics(female, female_matches_from_old_method, female_ties_from_old_method, "Original method (non-overlapping)")
new_female_metrics = calculate_metrics(female, new_female_matches, new_female_ties, "New method (overlapping range)")

print()
print("Comparison of classification results for Male datasets")
# old_male_metrics = calculate_metrics(male, male_matches_from_old_method, male_ties_from_old_method, "Original method (non-overlapping)")
new_male_metrics = calculate_metrics(male, new_male_matches, new_male_ties, "New method (overlapping range)")

Comparison of classification results for Female datasets
【New method (overlapping range)】
 Number of unique matches: 830 (41.8%)
 ties number: 67
 unmatched number: 1089


Comparison of classification results for Male datasets
【New method (overlapping range)】
 Number of unique matches: 1978 (48.5%)
 ties number: 166
 unmatched number: 1938



In [None]:
#Create a comparison summary table

#Using pandas
import pandas as pd

# 假设已经通过上述函数获得了旧的指标字典 (old_female_metrics, old_male_metrics)
# 这里构建一个对比DataFrame。需要替换`old_female_metrics`和`old_male_metrics`为实际变量。
# 举例：
# metrics_list = [old_female_metrics, new_female_metrics, old_male_metrics, new_male_metrics]

# 由于旧结果可能需要重新计算，这里先展示一个新方法的总结表
print("Summary of Results for the New Method (Overlapping Range)")

summary_data = [
    {'Dataset': 'Female', 'Matches': new_female_matches, 'Ties': new_female_ties},
    {'Dataset': 'Male', 'Matches': new_male_matches, 'Ties': new_male_ties}
]
summary_df = pd.DataFrame(summary_data)
print(summary_df.to_string(index=False))
print()


Summary of Results for the New Method (Overlapping Range)
Dataset                                                                   Matches  Ties
 Female {'XS': 23, 'S': 180, 'M': 230, 'L': 248, 'XL': 108, '2XL': 30, '3XL': 11}    67
   Male {'XS': 63, 'S': 419, 'M': 542, 'L': 532, 'XL': 287, '2XL': 88, '3XL': 47}   166



In [17]:
#analyse 
# A brief textual analysis of the effectiveness of the new method.
def analyze_and_conclude(new_matches, new_ties, old_matches=None, old_ties=None):
    """
    Based on the results from the old and new methods, print out the analysis conclusions.
    """
    
    print("1. Impact of overlap range design:")
    print(" - The new method sets a range for chest and shoulder width for each size, rather than a single upper limit.") #新方法为每个尺寸设定了胸围和肩宽的范围，而非单一上限。
    print(" - This makes it more likely that data points fall within a certain 'range', which should theoretically reduce the number of 'completely unclassifiable' cases. ") #这使得数据点更有可能落入某个尺寸的‘区间’内，理论上应减少‘完全无法分类’的情况。
    
    # The following analysis requires results from the old method; conditional statements are used here to demonstrate the logic. 以下分析需要旧方法的结果，这里用条件判断展示逻辑
    if old_matches is not None:
        total_new_matches = sum(new_matches.values())
        total_old_matches = sum(old_matches.values())
        change = total_new_matches - total_old_matches
        if change > 0:
            print(f"2. Change in the number of matches: The new method identifies {change} more unique matches than the original method.") #2. 匹配数量变化：新方法比原方法多确定了 {change} 个唯一匹配。
        elif change < 0:
            print(f"2. Change in the number of matches: The new method identifies {-change} fewer unique matches than the original method.") #匹配数量变化：新方法比原方法少确定了 {-change} 个唯一匹配。
        else:
            print("2. Changes in the number of matches: The number of unique matches remains unchanged.") #2. 匹配数量变化：唯一匹配数量未变。
    
    print("3. Key observation points:")
    print("   - Check the number of 'Ties'. More ties mean that more people's measurements fall in the overlapping area of multiple sizes.") #- 检查 '并列(Ties)' 的数量。并列增多意味着更多人的测量值落在多个尺寸的重叠区域。
    print("   - Check the number of 'Unmatched' entries. Ideally, this number should be zero or very low.") #- 检查 '未能分类(Unmatched)' 的数量。理想情况下，此数量应降为0或极低。
    print("\nConclusion: By introducing overlapping range intervals, the new size chart provides a more flexible classification method.") #结论：通过引入重叠的范围区间，新的尺寸图表提供了一种更灵活的分类方式，
    print("It can better handle individual data that is near the standard size boundaries.") #能够更好地处理处于标准尺寸边界附近的个体数据。

# 调用分析函数（要传入旧方法的结果进行完整对比）
# analyze_and_conclude(new_female_matches, new_female_ties, old_female_matches, old_female_ties)
# 暂时先只基于新结果进行分析
analyze_and_conclude(new_female_matches, new_female_ties)



1. Impact of overlap range design:
 - The new method sets a range for chest and shoulder width for each size, rather than a single upper limit.
 - This makes it more likely that data points fall within a certain 'range', which should theoretically reduce the number of 'completely unclassifiable' cases. 
3. Key observation points:
   - Check the number of 'Ties'. More ties mean that more people's measurements fall in the overlapping area of multiple sizes.
   - Check the number of 'Unmatched' entries. Ideally, this number should be zero or very low.

Conclusion: By introducing overlapping range intervals, the new size chart provides a more flexible classification method.
It can better handle individual data that is near the standard size boundaries.
