# Predicting T-shirt size using the ANSUR II dataset
We will here try to predict a persons t-shirt size given the weight and height of the person. We will use the ANSUR II dataset which contains a lot of information about the physical attributes of a large number of people.
 
We will first try to map the persons in the dataset to a t-shirt size. It is hard to find a concise size chart for t-shirt so we will create our own, initial chart, based on these assumptions:
 
We will only look at two measurements, Shoulder Width and Chest Circumference.
 
Our first problem is that Shoulder Width is not one of the measurements taken in the dataset. But we have Biacromial Breadth which is the distance between the two acromion processes. We will assume that this is the same as Shoulder Width.
 
We will then have these initial rules:
 
| Size | Percentile |
|------|------------|
| XS   | 0-5        |
| S    | 5-25       |
| M    | 25-50      |
| L    | 50-75      |
| XL   | 75-90      |
| XXL  | 90-97      |
| XXXL | 97-100     |
 

 # 使用 ANSUR II 数据集预测 T 恤尺码
我们将尝试根据一个人的体重和身高来预测其 T 恤尺码。我们将使用 ANSUR II 数据集，该数据集包含大量人群的身体特征信息。

首先，我们将尝试把数据集中的人员与 T 恤尺码对应起来。由于很难找到简洁的 T 恤尺码表，我们将基于以下假设创建自己的初始尺码表：

我们只考虑两个测量值：肩宽和胸围。

我们遇到的第一个问题是，肩宽并非数据集中包含的测量值之一。但我们有肩峰间距，即两个肩峰之间的距离。我们将假设它与肩宽相同。

因此，我们将有以下初始规则：

| 尺码 | 百分位数 |

|------|------------|

| XS | 0-5 |

| S | 5-25 |

| M | 25-50 |

| L | 50-75 |

| XL | 75-90 |

| XXL | 90-97 |

| XXXL | 97-100 |

## Inspect the data (read the data)

In [1]:
import pandas as pd

In [2]:
female = pd.read_csv('./data/female.csv')
male = pd.read_csv('./data/male.csv')

In [3]:
print(f"Foe women we have (rows, columns){female.shape}") #we will get tuple  rows shows how many people, columns shows how many data, we will end with 2 columns
print(f"Foe men we have (rows, columns){female.shape}")

Foe women we have (rows, columns)(1986, 108)
Foe men we have (rows, columns)(1986, 108)


## we need to calculate the percentile first
## checking the percentiles

let us determine the percentiles of the data 让我们确定数据的百分位数。

In [4]:
def compute_percentile_ranges(column):
#Define percentile ranges (the size and percentile table) creat a list of tuple of 0-5 5-12
    ranges = [(0, 5),(5, 25),(25, 50),(50, 75),(75, 90),(90, 97),(97, 100)]
    #create a dictionary, key will be 0,
    #quantile is from panda, is the column,returns of value 

    percentiles = {(low, high): (column.quantile(low/100), column.quantile(high/100)) for low, high in ranges} #0 is low 5 is high, 5 is low, high is 25, a list cannot be key value

    counts = {}

    for r, (low, high) in percentiles.items(): #items is dictionary is a key value
        counts[r] = ((column >= low) & (column < high)).sum()
    
    return counts
  
print(compute_percentile_ranges(female['chestcircumference']))
print(compute_percentile_ranges(female['biacromialbreadth'])) #双肩峰宽度 we missed 7 ppl, and the chest is 396(25,50) and biacro is 477(50,75) what size should this people take?

print(compute_percentile_ranges(male['chestcircumference']))
print(compute_percentile_ranges(male['biacromialbreadth']))

#output : np.float64(695.0), np.float64(824.25) is chestcumference size, can choose (5,25) percentile size

{(0, 5): np.int64(100), (5, 25): np.int64(396), (25, 50): np.int64(492), (50, 75): np.int64(499), (75, 90): np.int64(299), (90, 97): np.int64(140), (97, 100): np.int64(59)}
{(0, 5): np.int64(93), (5, 25): np.int64(377), (25, 50): np.int64(477), (50, 75): np.int64(541), (75, 90): np.int64(297), (90, 97): np.int64(139), (97, 100): np.int64(61)}
{(0, 5): np.int64(199), (5, 25): np.int64(810), (25, 50): np.int64(1025), (50, 75): np.int64(1012), (75, 90): np.int64(616), (90, 97): np.int64(295), (97, 100): np.int64(124)}
{(0, 5): np.int64(191), (5, 25): np.int64(787), (25, 50): np.int64(989), (50, 75): np.int64(1079), (75, 90): np.int64(610), (90, 97): np.int64(303), (97, 100): np.int64(122)}


## Generate the t-shirt size chart

In [5]:
def compute_size_percentile_measurements(data, chest_column, shoulder_column):
    sizes = ['XS', 'S', 'M', 'L', 'XL', '2XL', '3XL']
    ranges = [0, 5, 25, 50, 75, 90, 97]

    #compute the values for each percentile for chest and shoulder 计算胸围和肩部各百分位数的数值
    chest_percentiles = {p: data[chest_column].quantile(p/100)for p in ranges}
    shoulder_percentiles = {p: data[shoulder_column].quantile(p/100)for p in ranges}
    #print(chest_percentiles)
    #print(shoulder_percentiles)

    #Map the t-shirt size to the corresponding chest and shoulder measurements
    size_mappings = {}
    for i, size in enumerate(sizes):
        size_mappings[size] = {
             #create dictionary x, s, m, l..
             'Chest' : int(chest_percentiles[ranges[i]]),
             'Shoulder' : int(shoulder_percentiles[ranges[i]]),
        }
    return size_mappings

print(compute_size_percentile_measurements(female, 'chestcircumference', 'biacromialbreadth'))
print(compute_size_percentile_measurements(male, 'chestcircumference', 'biacromialbreadth'))


{'XS': {'Chest': 695, 'Shoulder': 283}, 'S': {'Chest': 824, 'Shoulder': 335}, 'M': {'Chest': 889, 'Shoulder': 353}, 'L': {'Chest': 940, 'Shoulder': 365}, 'XL': {'Chest': 999, 'Shoulder': 378}, '2XL': {'Chest': 1057, 'Shoulder': 389}, '3XL': {'Chest': 1117, 'Shoulder': 400}}
{'XS': {'Chest': 774, 'Shoulder': 337}, 'S': {'Chest': 922, 'Shoulder': 384}, 'M': {'Chest': 996, 'Shoulder': 403}, 'L': {'Chest': 1056, 'Shoulder': 415}, 'XL': {'Chest': 1117, 'Shoulder': 428}, '2XL': {'Chest': 1172, 'Shoulder': 441}, '3XL': {'Chest': 1233, 'Shoulder': 452}}


In [6]:
#copy the upper result and 按return健变成下面这样
female_size = {
    'XS': {'Chest': 695, 'Shoulder': 283},
    'S': {'Chest': 824, 'Shoulder': 335},
    'M': {'Chest': 889, 'Shoulder': 353},
    'L': {'Chest': 940, 'Shoulder': 365},
    'XL': {'Chest': 999, 'Shoulder': 378},
    '2XL': {'Chest': 1057, 'Shoulder': 389},
    '3XL': {'Chest': 1117, 'Shoulder': 400}
}

male_size = {
    'XS': {'Chest': 774, 'Shoulder': 337}, 
    'S': {'Chest': 922, 'Shoulder': 384}, 
    'M': {'Chest': 996, 'Shoulder': 403}, 
    'L': {'Chest': 1056, 'Shoulder': 415}, 
    'XL': {'Chest': 1117, 'Shoulder': 428}, 
    '2XL': {'Chest': 1172, 'Shoulder': 441}, 
    '3XL': {'Chest': 1233, 'Shoulder': 452}
}

1.21 Lab: For instance, a person might have size S for chest but size M for shoulders. Your task is to get a clearer picture of how many individuals have matching sizes for both measurements and how many have different sizes (i.e., they fall into different sizes for shoulder breadth and chest circumference).
 
1.Use the size chart: Use a size chart that specifies the limits for shoulder breadth and chest circumference for each size.
 
2.Create a function: Write a function that iterates through each person's measurements and compares them with the size chart.
 
3.Count matches and conflicts: 
The function should count the number of individuals who have exactly one matching size and the number of individuals who have multiple possible sizes (conflicts).
 
4.Test your function with both female and male datasets, and use appropriate size charts for each gender.

项目前期我们提到过，基于不同测量指标（例如胸围和肩宽）比较尺寸时可能会出现冲突。例如，一个人的胸围可能是 S 码，但肩宽可能是 M 码。你的任务是更清晰地了解有多少人的胸围和肩宽尺寸相同，以及有多少人的胸围和肩宽尺寸不同（即肩宽和胸围分别对应不同的尺寸）。

使用尺寸表：使用一个尺寸表，该尺寸表规定了每个尺寸对应的肩宽和胸围范围。

创建函数：编写一个函数，遍历每个人的测量数据，并将其与尺寸表进行比较。

统计匹配和冲突：该函数应统计只有一个尺寸匹配的人数，以及有多个可能尺寸（冲突）的人数。

使用女性和男性数据集测试你的函数，并为每种性别使用相应的尺寸表。

In [7]:
# 1.Use the size chart: Use a size chart that specifies the limits for shoulder breadth and chest circumference for each size.
# create a size chart尺码表, include every size min and max numbers
def create_size_ranges(size_chart):
    '''
    将上面的尺码表转换成包含区间(最小值和最大值)的形式 这样可以根据测量值判断属于哪个尺码
    每个尺码有胸围和肩宽的上下限
    '''
    size_ranges = {}
    sizes = list(size_chart.keys())
    
    #对每个尺码，根据百分位数确定区间 for every size, according to the percentile to confirm the range
    for i, size in enumerate(sizes):
        size_range = {'Chest': {}, 'Shoulder': {}}
        
        #if the first size is (XS), the min is 0
        if i == 0:
            size_range['Chest']['min'] = 0
            size_range['Chest']['max'] = size_chart[size]['Chest']
            size_range['Shoulder']['min'] = 0
            size_range['Shoulder']['max'] = size_chart[size]['Shoulder']
            
        #if is the last size, the max is inf 无穷大∞
        elif i == len(sizes) - 1:
            size_range['Chest']['min'] = size_chart[sizes[i-1]]['Chest']
            size_range['Chest']['max'] = float('inf')
            size_range['Shoulder']['min'] = size_chart[sizes[i-1]]['Shoulder']
            size_range['Shoulder']['max'] = float('inf')
        
        #middle size
        else:
            size_range['Chest']['min'] = size_chart[sizes[i-1]]['Chest']
            size_range['Chest']['max'] = size_chart[size]['Chest']
            size_range['Shoulder']['min'] = size_chart[sizes[i-1]]['Shoulder']
            size_range['Shoulder']['max'] = size_chart[size]['Shoulder']
            
        size_ranges[size] = size_range
    
    return size_ranges


# 创建区间尺码表
female_size_ranges = create_size_ranges(female_size)
male_size_ranges = create_size_ranges(male_size)
    

#2. Create a function: Write a function that iterates through each person's measurements and compares them with the size chart.
#遍历数据集中的每个人
#根据胸围和肩宽分别确定对应的尺码
#比较两个测量值对应的尺码是否相同
#统计匹配和冲突的人数
#返回每个人的详细匹配信息

In [8]:
#check everyone chest and shoulder size matching situation
def check_size_matching(data, size_ranges, chest_col='chestCircumference', shoulder_col='biacromialbreadth'):
    """
    检查每个人在胸围和肩宽测量值上的尺码匹配情况
    参数:
    data: 数据集 (DataFrame)
    size_ranges: 尺码区间表
    chest_col: 胸围列名
    shoulder_col: 肩宽列名
    
    返回:
    matched_count: 匹配人数（胸围和肩宽属于同一尺码）
    conflicted_count: 冲突人数（胸围和肩宽属于不同尺码）
    size_details: 每个人尺码匹配的详细信息
    """
    matched_count = 0
    conflicted_count = 0
    size_details = []
    
    #loop everyone 遍历数据集中的每个人
    for idx, row in data.iterrows():
        chest_measurement = row[chest_col]
        shoulder_measurement = row[shoulder_col]
        
        chest_size = None
        shoulder_size = None
        
        #根据胸围确定尺码 according to the chest to confirm the size
        for size, ranges in size_ranges.items():
            if ranges['Chest']['min'] <= chest_measurement < ranges['Chest']['max']:
                chest_size = size
                break
        
        #根据肩宽确定尺码 according to the shoulder to confirm the size
        for size, ranges in size_ranges.items():
            if ranges['Shoulder']['min'] <= shoulder_measurement < ranges['Shoulder']['max']:
                shoulder_size = size
                break
            
        #返回每个人的详细匹配信息 count everyone's size info
        person_detail = {
            'index': idx,
            'chest_measurement': chest_measurement,
            'shoulder_measurement': shoulder_measurement,
            'chest_size': chest_size,
            'shoulder_size': shoulder_size,
            'match': chest_size == shoulder_size
            
        }
        
        size_details.append(person_detail)
        
        
#3.Count matches and conflicts: 
#The function should count the number of individuals who have exactly one matching size and the number of individuals who have multiple possible sizes (conflicts).
        
        #比较两个测量值对应的尺码是否相同 
        #统计胸围数据和肩宽数据的匹配和冲突
        if chest_size == shoulder_size:
            matched_count +=1
        else:
            conflicted_count +=1
    
    return matched_count, conflicted_count, size_details


#Your task is to get a clearer picture of 
# how many individuals have matching sizes for both measurements 
# and how many have different sizes

#测试女性数据集 female datasets
print('Female data analysis:')
female_matched, female_conflicted, female_details = check_size_matching(female, female_size_ranges, 'chestcircumference', 'biacromialbreadth')

#输出匹配人数、冲突人数和匹配率
print(f'matched: {female_matched}, conflicted: {female_conflicted}')
print(f'matched percentage: {female_matched / (female_matched + female_conflicted) * 100:.2f}%')

#测试男性数据集 male datasets
print('Male data analysis:')
male_matched, male_conflicted, male_details = check_size_matching(male, male_size_ranges, 'chestcircumference', 'biacromialbreadth')

#输出匹配人数、冲突人数和匹配率
print(f'matched: {male_matched}, conflicted: {male_conflicted}')
print(f'matched percentage: {male_matched / (male_matched + male_conflicted) * 100:.2f}%')



Female data analysis:
matched: 490, conflicted: 1496
matched percentage: 24.67%
Male data analysis:
matched: 1252, conflicted: 2830
matched percentage: 30.67%
