# Домашнее задание к лекции "Базовые понятия статистики"

## Обязательная часть

Будем осуществлять работу с непростым [набором данных](https://raw.githubusercontent.com/obulygin/pyda_homeworks/master/statistics_basics/horse_data.csv) о состоянии здоровья лошадей, испытывающих кишечные колики. 

### Задание 1. Базовое изучение

Изучить представленный набор данных на основе [описания его столбцов](https://raw.githubusercontent.com/obulygin/pyda_homeworks/master/statistics_basics/horse_data.names) и выбрать 8 столбцов для дальнейшего изучения (среди них должны быть как числовые, так и категориальные). Провести расчет базовых метрик для них, кратко описать результаты.

### Задание 2. Работа с выбросами

В выбранных числовых столбцах найти выбросы, выдвинуть гипотезы об их причинах и проинтерпретировать результаты. Принять и обосновать решение о дальнейшей работе с ними.

### Задание 3. Работа с пропусками

Рассчитать количество пропусков для всех выбранных столбцов. Принять и обосновать решение о методе работы с пропусками по каждому столбцу, сформировать датафрейм, в котором пропуски будут отсутствовать.

# 1. Basic study

In [1]:
import pandas as pd

In [2]:
import requests
from bs4 import BeautifulSoup
import regex as re

In [3]:
# get dataframe description to extract column names
req = requests.get('https://raw.githubusercontent.com/obulygin/pyda_homeworks/master/statistics_basics/horse_data.names')
soup = BeautifulSoup(req.text, 'html.parser')
info = str(soup)
soup

1. TItle: Horse Colic database

2. Source Information
   -- Creators: Mary McLeish &amp; Matt Cecile
	  	Department of Computer Science
		University of Guelph
		Guelph, Ontario, Canada N1G 2W1
		mdmcleish@water.waterloo.edu
   -- Donor:    Will Taylor (taylor@pluto.arc.nasa.gov)
   -- Date:     8/6/89

3. Past Usage:
   -- Unknown

4. Relevant Information:

   -- 2 data files
      -- horse-colic.data: 300 training instances
      -- horse-colic.test: 68 test instances
   -- Possible class attributes: 24 (whether lesion is surgical)
     -- others include: 23, 25, 26, and 27
   -- Many Data types: (continuous, discrete, and nominal)

5. Number of Instances: 368 (300 for training, 68 for testing)

6. Number of attributes: 28

7. Attribute Information:

  1:  surgery?
          1 = Yes, it had surgery
          2 = It was treated without surgery

  2:  Age
          1 = Adult horse
          2 = Young (&lt; 6 months)

  3:  Hospital Number
          - numeric id
          - the case numb

In [4]:
# get the list of the column names
header = re.findall(r'\d{1,2}\:\s+(.+)\n',info)
print(header)
print(len(header))

['surgery?', 'Age', 'Hospital Number', 'rectal temperature', 'pulse', 'respiratory rate', 'temperature of extremities', 'peripheral pulse', 'mucous membranes', 'capillary refill time', "pain - a subjective judgement of the horse's pain level", 'peristalsis', 'abdominal distension', 'nasogastric tube', 'nasogastric reflux', 'nasogastric reflux PH', 'rectal examination - feces', 'abdomen', 'packed cell volume', 'total protein', 'abdominocentesis appearance', 'abdomcentesis total protein', 'outcome', 'surgical lesion?', 'type of lesion', 'cp_data']
26


In [5]:
# columns  25-26 are missing (described along with 27 - same names) - let's add artificially
header.insert(-1, header[-2]+'1')
header.insert(-1, header[-3]+'2')
print(header)
len(header)

['surgery?', 'Age', 'Hospital Number', 'rectal temperature', 'pulse', 'respiratory rate', 'temperature of extremities', 'peripheral pulse', 'mucous membranes', 'capillary refill time', "pain - a subjective judgement of the horse's pain level", 'peristalsis', 'abdominal distension', 'nasogastric tube', 'nasogastric reflux', 'nasogastric reflux PH', 'rectal examination - feces', 'abdomen', 'packed cell volume', 'total protein', 'abdominocentesis appearance', 'abdomcentesis total protein', 'outcome', 'surgical lesion?', 'type of lesion', 'type of lesion1', 'type of lesion2', 'cp_data']


28

In [6]:
# Read file and name the columns; apparently '?' stands for missing values - replacing with NaN
df = pd.read_csv('https://raw.githubusercontent.com/obulygin/pyda_homeworks/master/statistics_basics/horse_data.csv', names=header, na_values = '?')
df.head()

Unnamed: 0,surgery?,Age,Hospital Number,rectal temperature,pulse,respiratory rate,temperature of extremities,peripheral pulse,mucous membranes,capillary refill time,...,packed cell volume,total protein,abdominocentesis appearance,abdomcentesis total protein,outcome,surgical lesion?,type of lesion,type of lesion1,type of lesion2,cp_data
0,2.0,1,530101,38.5,66.0,28.0,3.0,3.0,,2.0,...,45.0,8.4,,,2.0,2,11300,0,0,2
1,1.0,1,534817,39.2,88.0,20.0,,,4.0,1.0,...,50.0,85.0,2.0,2.0,3.0,2,2208,0,0,2
2,2.0,1,530334,38.3,40.0,24.0,1.0,1.0,3.0,1.0,...,33.0,6.7,,,1.0,2,0,0,0,1
3,1.0,9,5290409,39.1,164.0,84.0,4.0,1.0,6.0,2.0,...,48.0,7.2,3.0,5.3,2.0,1,2208,0,0,1
4,2.0,1,530255,37.3,104.0,35.0,,,6.0,2.0,...,74.0,7.4,,,2.0,2,4300,0,0,2


In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 300 entries, 0 to 299
Data columns (total 28 columns):
 #   Column                                                   Non-Null Count  Dtype  
---  ------                                                   --------------  -----  
 0   surgery?                                                 299 non-null    float64
 1   Age                                                      300 non-null    int64  
 2   Hospital Number                                          300 non-null    int64  
 3   rectal temperature                                       240 non-null    float64
 4   pulse                                                    276 non-null    float64
 5   respiratory rate                                         242 non-null    float64
 6   temperature of extremities                               244 non-null    float64
 7   peripheral pulse                                         231 non-null    float64
 8   mucous membranes              

In [8]:
df.describe()

Unnamed: 0,surgery?,Age,Hospital Number,rectal temperature,pulse,respiratory rate,temperature of extremities,peripheral pulse,mucous membranes,capillary refill time,...,packed cell volume,total protein,abdominocentesis appearance,abdomcentesis total protein,outcome,surgical lesion?,type of lesion,type of lesion1,type of lesion2,cp_data
count,299.0,300.0,300.0,240.0,276.0,242.0,244.0,231.0,253.0,268.0,...,271.0,267.0,135.0,102.0,299.0,300.0,300.0,300.0,300.0,300.0
mean,1.397993,1.64,1085889.0,38.167917,71.913043,30.417355,2.348361,2.017316,2.853755,1.30597,...,46.295203,24.456929,2.037037,3.019608,1.551839,1.363333,3657.88,90.226667,7.363333,1.67
std,0.490305,2.173972,1529801.0,0.732289,28.630557,17.642231,1.045054,1.042428,1.620294,0.477629,...,10.419335,27.475009,0.804905,1.968567,0.737187,0.481763,5399.513513,649.569234,127.536674,0.470998
min,1.0,1.0,518476.0,35.4,30.0,8.0,1.0,1.0,1.0,1.0,...,23.0,3.3,1.0,0.1,1.0,1.0,0.0,0.0,0.0,1.0
25%,1.0,1.0,528904.0,37.8,48.0,18.5,1.0,1.0,1.0,1.0,...,38.0,6.5,1.0,2.0,1.0,1.0,2111.75,0.0,0.0,1.0
50%,1.0,1.0,530305.5,38.2,64.0,24.5,3.0,2.0,3.0,1.0,...,45.0,7.5,2.0,2.25,1.0,1.0,2673.5,0.0,0.0,2.0
75%,2.0,1.0,534727.5,38.5,88.0,36.0,3.0,3.0,4.0,2.0,...,52.0,57.0,3.0,3.9,2.0,2.0,3209.0,0.0,0.0,2.0
max,2.0,9.0,5305629.0,40.8,184.0,96.0,4.0,4.0,6.0,3.0,...,75.0,89.0,3.0,10.1,3.0,2.0,41110.0,7111.0,2209.0,2.0


In [9]:
def describe_more (measure):
    """calculates and prints  mode, standard deviation, dispersion and number of missing values per column in df
    """
    print(measure.describe())
    print(f'Mode: {measure.mode()[0]}')
    print(f'Standard deviation: {measure.std()}') # стандартное отклонение
    print(f'Dispersion: {measure.var()}') # дисперсия 
    print(f'Number of missing records: {len(df)-measure.count()} ({(len(df)-measure.count())*100/len(df)}%)')

In [10]:
# list of columns for analisys
analysis_list = ['rectal temperature',
'pulse',
'respiratory rate',
'temperature of extremities',
'peripheral pulse',
'mucous membranes',
'capillary refill time',
'packed cell volume']

In [11]:
# quantitative data requires diffrent approach:
quantitative_columns = ['peripheral pulse',
                        'mucous membranes',
                        'capillary refill time']

In [34]:
def share_in_total(measure):
    num_per_rate = df.groupby(by=[measure]).agg({measure: 'count'})
    num_sum = df.agg({measure: 'count'})
    share = num_per_rate.groupby(level=0).apply(lambda x: 100 * x / num_sum).rename(columns= {measure:'share in total, %'})
    return share

In [35]:
# print information for each of the selected columns
for column in analysis_list:
    print(column)
    measure=df[column]
    #print(measure.describe())
    describe_more(measure)
    print('\n')

rectal temperature
count    300.000000
mean      38.160167
std        0.656105
min       35.400000
25%       37.900000
50%       38.100000
75%       38.500000
max       40.800000
Name: rectal temperature, dtype: float64
Mode: 38.1
Standard deviation: 0.6561045772409132
Dispersion: 0.4304732162764775
Number of missing records: 0 (0.0%)


pulse
count    300.000000
mean      71.640000
std       27.818278
min       30.000000
25%       48.000000
50%       64.000000
75%       88.000000
max      184.000000
Name: pulse, dtype: float64
Mode: 60.0
Standard deviation: 27.818277959441758
Dispersion: 773.8565886287631
Number of missing records: 0 (0.0%)


respiratory rate
count    300.000000
mean      29.273333
std       16.010979
min        8.000000
25%       20.000000
50%       24.500000
75%       34.250000
max       96.000000
Name: respiratory rate, dtype: float64
Mode: 24.5
Standard deviation: 16.0109793711576
Dispersion: 256.3514604236342
Number of missing records: 0 (0.0%)


temperature of ex

In [36]:
# print information for each of the selected columns
for column in quantitative_columns:
    print(column)
    print(share_in_total(column))
    print('\n')

peripheral pulse
                  share in total, %
peripheral pulse                   
1.0                       48.000000
2.0                        1.666667
3.0                       47.666667
4.0                        2.666667


mucous membranes
                  share in total, %
mucous membranes                   
1.0                       26.333333
2.0                       14.333333
3.0                       30.000000
4.0                       14.333333
5.0                        8.333333
6.0                        6.666667


capillary refill time
                       share in total, %
capillary refill time                   
1.0                            71.666667
2.0                            28.333333




## Description from the file with summary: 

#### 4:  rectal temperature
          - linear
          - in degrees celsius.
          - An elevated temp may occur due to infection.
          - temperature may be reduced when the animal is in late shock
          - normal temp is 37.8
          - this parameter will usually change as the problem progresses
               eg. may start out normal, then become elevated because of
                   the lesion, passing back through the normal range as the
                   horse goes into shock
#### Summary: 
Quantitative, continuous.
Based on min and max - all values are realistic  - outliers shouldn't be excluded. 
Percentile 25 complies with normal temp - in 75% increased temp is observed, as expected since those are sick horses. 
Both Mode and median also represent increased temperature. 
Standard deviation less than 1 degree - low. Values are close to mean, but it's xplained by the narrow range of temperatures in Celcius degrees possible for living mamal.
60 (20.0%) records are missing. The missing values can be filled in based on temperature of extremities (hot extremities should correlate with an elevated rectal temp) and type of leasion (25) if second number indicates inflamation (3)
  
####  5:  pulse
          - linear
          - the heart rate in beats per minute
          - is a reflection of the heart condition: 30 -40 is normal for adults
          - rare to have a lower than normal rate although athletic horses
            may have a rate of 20-25
          - animals with painful lesions or suffering from circulatory shock
            may have an elevated heart rate
#### Summary: 
Quantitative, continuous.
Min (30) complies with the normal, percntile 25 (48) is already slightly above normal. The majority of records indicate increased pulse, which is possible for suffering animal. It is difficult to judge without medical knowledge whether maximum of 184 s a plausible value for pulse, but it's not entirely unrealistic - needs further investigation.
Mean (71.9) and median (64) indcate that average pulse is increased double to what's concidered normal; though mode (48) is close to normal - which indicates that the upper outliers influence the average. 
Standard deviation (26.8) is comparable to the minimum (30) - indicates that values are spread in a wide range.
24 (8.0%) records are missing. The missing values can be filled in based on pain and peripheral pulse.

 
####  6:  respiratory rate
          - linear
          - normal rate is 8 to 10
          - usefulness is doubtful due to the great fluctuations

  
#### Summary:
Quantitative, continuous.
Min (8) complies with the normal, while percntile 25 (18) is already above normal. The majority of records indicate increased respiratory rate. The maximum of 96 seems plausible.
Mean (30.4), median (24.5) and mode (20) again indcate that average value is inflenced by high extremities on the upper side of the range. 
Standard deviation (17.6) is comparable to mode (20) - indicates that values are spread in a wide range.
58 (19%) records are missing. The missing values can be filled with the Median since 'usefulness is doubtful due to the great fluctuations'.

####  7:  temperature of extremities
          - a subjective indication of peripheral circulation
          - possible values:
               1 = Normal
               2 = Warm
               3 = Cool
               4 = Cold
          - cool to cold extremities indicate possible shock
          - hot extremities should correlate with an elevated rectal temp.
  
#### Summary:
Qualitative, ordinal.
The mode (3) - cool extremities - is the most common observation; indicate possible shock. 
56 (18.7%) records are missing. The missing values can be filled based on rectal temp (in case of elevated rectal temp). 
 
####  8:  peripheral pulse
          - subjective
          - possible values are:
               1 = normal
               2 = increased
               3 = reduced
               4 = absent
          - normal or increased p.p. are indicative of adequate circulation
            while reduced or absent indicate poor perfusion

peripheral pulse
                  share in total, %
peripheral pulse                   
1.0                       49.783550
2.0                        2.164502
3.0                       44.588745
4.0                        3.463203


#### Summary:
Qualitative, ordinal.
The mode (1) - normal - is the most common observation; indicates possible shock. 
About half observations  are 1 = normal, indicative of adequate circulation, and slightly less than half 3 = reduced, indicative of poor perfusion.
69 (23.0%) records are missing. The missing values can be filled in based on capillary refill time. 


####  9:  mucous membranes
          - a subjective measurement of colour
          - possible values are:
               1 = normal pink
               2 = bright pink
               3 = pale pink
               4 = pale cyanotic
               5 = bright red / injected
               6 = dark cyanotic
          - 1 and 2 probably indicate a normal or slightly increased
            circulation
          - 3 may occur in early shock
          - 4 and 6 are indicative of serious circulatory compromise
          - 5 is more indicative of a septicemia

mucous membranes
                  share in total, %
mucous membranes                   
1.0                       31.225296
2.0                       11.857708
3.0                       22.924901
4.0                       16.205534
5.0                        9.881423
6.0                        7.905138


#### Summary:
Qualitative, ordinal.
The mode (1) - normal pink - is the most common observation; indicates a normal or slightly increased circulation. 
43% (options 1 and 2) probably indicate a normal or slightly increased circulation; 22,9% (option 3) may occur in early shock; 24% (options 4 and 6) are indicative of serious circulatory compromise; 9.9% (option 5) is more indicative of a septicemia
47 (15.6%) records are missing. The missing values can be filled in based on capillary refill time or temperature of extremities. 

#### 10: capillary refill time
          - a clinical judgement. The longer the refill, the poorer the
            circulation
          - possible values
               1 = < 3 seconds
               2 = >= 3 seconds
 
                       share in total, %
capillary refill time                   
1.0                            70.149254
2.0                            29.104478
3.0                             0.746269

#### Summary:
Qualitative, ordinal.
The majority of records (70%) as well as mode are option 1 (< 3 seconds) - implies short refil time, which means good circulation; while only 29% are option 2(>= 3 seconds), meaining poorer circlation. 
There is also option 3 which is not mentioned in the file description (less than 1%). I would consider this aт input error and replace with option 2 (since it probably means an even longer refill time)  
32 (10.6%) records are missing. The missing values can be filled based on mucous membranes.

####  19: packed cell volume
          - linear
          - the # of red cells by volume in the blood
          - normal range is 30 to 50. The level rises as the circulation
            becomes compromised or as the animal becomes dehydrated.
            
 #### Summary: 
Quantitative, continuous.
Min (23) is below normal (no information on possible reasons), values within  25 and 75 percnetiles (38-52) are close to the normal range; mode (37) is also within the normal range. Max (75) indicates increased packed cell volume, but majority of records are within the norm. 
It is difficult to judge without medical knowledge whether maximum of 75 is a plausible value, but it's not entirely unrealistic - needs further investigation.
Standard deviation (10.4) is comparable to half of the norlmal range ((50-30)/2).
29 (9.7%) records are missing. The missing values can be filled in based on peripheral pulse and capillary refill time, as it's related with circulation properties.         


# 2. Outliers

In [16]:
def outliers(measure):
    """ calculates lower and upper bounds for outliers 
        and shows all rows in df with outliers in the particular column"""
    q1 = measure.quantile(0.25)
    q3 = measure.quantile(0.75)
    iqr = q3 - q1
    lower_bound = q1 - (1.5 * iqr)
    upper_bound = q1 + (1.5 * iqr)
    #return [lower_bound, upper_bound]
    return df.loc[(measure < lower_bound) | (measure > upper_bound)]


In [17]:
outliers(df['rectal temperature'])

Unnamed: 0,surgery?,Age,Hospital Number,rectal temperature,pulse,respiratory rate,temperature of extremities,peripheral pulse,mucous membranes,capillary refill time,...,packed cell volume,total protein,abdominocentesis appearance,abdomcentesis total protein,outcome,surgical lesion?,type of lesion,type of lesion1,type of lesion2,cp_data
1,1.0,1,534817,39.2,88.0,20.0,,,4.0,1.0,...,50.0,85.0,2.0,2.0,3.0,2,2208,0,0,2
3,1.0,9,5290409,39.1,164.0,84.0,4.0,1.0,6.0,2.0,...,48.0,7.2,3.0,5.3,2.0,1,2208,0,0,1
11,2.0,1,527927,39.1,72.0,52.0,2.0,,2.0,1.0,...,50.0,7.8,,,1.0,1,2111,0,0,2
19,2.0,1,532110,39.4,110.0,35.0,4.0,3.0,6.0,,...,55.0,8.7,,,1.0,2,0,0,0,2
20,1.0,1,530157,39.9,72.0,60.0,1.0,1.0,5.0,2.0,...,46.0,6.1,2.0,,1.0,1,2111,0,0,2
39,1.0,9,5277409,39.2,146.0,96.0,,,,,...,,,,,2.0,1,2113,0,0,2
41,2.0,9,5288249,39.0,150.0,72.0,,,,,...,47.0,8.5,,0.1,1.0,1,9400,0,0,1
44,1.0,1,535407,35.4,140.0,24.0,3.0,3.0,4.0,2.0,...,57.0,69.0,3.0,2.0,3.0,1,3205,0,0,2
48,1.0,1,528890,38.9,80.0,44.0,3.0,3.0,3.0,2.0,...,54.0,6.5,3.0,,2.0,1,7111,0,0,2
54,2.0,1,529461,40.3,114.0,36.0,3.0,3.0,1.0,2.0,...,57.0,8.1,3.0,4.5,3.0,1,7400,0,0,1


In [18]:
outliers(df['pulse'])

Unnamed: 0,surgery?,Age,Hospital Number,rectal temperature,pulse,respiratory rate,temperature of extremities,peripheral pulse,mucous membranes,capillary refill time,...,packed cell volume,total protein,abdominocentesis appearance,abdomcentesis total protein,outcome,surgical lesion?,type of lesion,type of lesion1,type of lesion2,cp_data
3,1.0,9,5290409,39.1,164.0,84.0,4.0,1.0,6.0,2.0,...,48.0,7.2,3.0,5.3,2.0,1,2208,0,0,1
16,1.0,9,5301219,,128.0,36.0,3.0,3.0,4.0,2.0,...,53.0,7.8,3.0,4.7,2.0,2,1400,0,0,1
19,2.0,1,532110,39.4,110.0,35.0,4.0,3.0,6.0,,...,55.0,8.7,,,1.0,2,0,0,0,2
23,1.0,9,534998,38.3,130.0,60.0,,3.0,,1.0,...,50.0,70.0,,,1.0,1,3111,0,0,2
36,2.0,1,529493,38.3,112.0,16.0,,3.0,5.0,2.0,...,51.0,6.0,2.0,1.0,3.0,2,5205,0,0,1
39,1.0,9,5277409,39.2,146.0,96.0,,,,,...,,,,,2.0,1,2113,0,0,2
41,2.0,9,5288249,39.0,150.0,72.0,,,,,...,47.0,8.5,,0.1,1.0,1,9400,0,0,1
43,1.0,1,534069,,120.0,,3.0,4.0,4.0,1.0,...,52.0,67.0,2.0,2.0,3.0,1,3205,0,0,2
44,1.0,1,535407,35.4,140.0,24.0,3.0,3.0,4.0,2.0,...,57.0,69.0,3.0,2.0,3.0,1,3205,0,0,2
45,2.0,1,529827,,120.0,,4.0,3.0,4.0,2.0,...,60.0,6.5,3.0,,2.0,1,3205,0,0,2


In [19]:
outliers(df['respiratory rate'])

Unnamed: 0,surgery?,Age,Hospital Number,rectal temperature,pulse,respiratory rate,temperature of extremities,peripheral pulse,mucous membranes,capillary refill time,...,packed cell volume,total protein,abdominocentesis appearance,abdomcentesis total protein,outcome,surgical lesion?,type of lesion,type of lesion1,type of lesion2,cp_data
3,1.0,9,5290409,39.1,164.0,84.0,4.0,1.0,6.0,2.0,...,48.0,7.2,3.0,5.3,2.0,1,2208,0,0,1
11,2.0,1,527927,39.1,72.0,52.0,2.0,,2.0,1.0,...,50.0,7.8,,,1.0,1,2111,0,0,2
15,1.0,1,530233,37.6,96.0,48.0,3.0,1.0,4.0,1.0,...,45.0,6.8,,,2.0,1,3207,0,0,2
20,1.0,1,530157,39.9,72.0,60.0,1.0,1.0,5.0,2.0,...,46.0,6.1,2.0,,1.0,1,2111,0,0,2
23,1.0,9,534998,38.3,130.0,60.0,,3.0,,1.0,...,50.0,70.0,,,1.0,1,3111,0,0,2
39,1.0,9,5277409,39.2,146.0,96.0,,,,,...,,,,,2.0,1,2113,0,0,2
41,2.0,9,5288249,39.0,150.0,72.0,,,,,...,47.0,8.5,,0.1,1.0,1,9400,0,0,1
49,2.0,1,529642,37.2,84.0,48.0,3.0,3.0,5.0,2.0,...,73.0,5.5,2.0,4.1,2.0,2,4300,0,0,1
82,1.0,9,5290759,38.1,100.0,80.0,3.0,1.0,2.0,1.0,...,36.0,5.7,,,1.0,1,3111,0,0,2
84,1.0,1,529849,37.8,60.0,80.0,1.0,3.0,2.0,2.0,...,40.0,4.5,2.0,,1.0,1,5206,0,0,1


In [20]:
df_pcv = outliers(df['packed cell volume'])
df_pcv[df_pcv.columns[3:20]]

Unnamed: 0,rectal temperature,pulse,respiratory rate,temperature of extremities,peripheral pulse,mucous membranes,capillary refill time,pain - a subjective judgement of the horse's pain level,peristalsis,abdominal distension,nasogastric tube,nasogastric reflux,nasogastric reflux PH,rectal examination - feces,abdomen,packed cell volume,total protein
4,37.3,104.0,35.0,,,6.0,2.0,,,,,,,,,74.0,7.4
30,37.7,96.0,30.0,3.0,3.0,4.0,2.0,5.0,4.0,4.0,3.0,2.0,4.0,4.0,5.0,66.0,7.5
35,,104.0,24.0,4.0,3.0,3.0,2.0,4.0,4.0,3.0,,3.0,,,2.0,73.0,8.4
40,,88.0,,3.0,3.0,6.0,2.0,5.0,3.0,3.0,1.0,3.0,,4.0,5.0,63.0,6.5
45,,120.0,,4.0,3.0,4.0,2.0,5.0,4.0,4.0,1.0,1.0,,4.0,5.0,60.0,6.5
46,37.9,60.0,15.0,3.0,,4.0,2.0,5.0,4.0,4.0,2.0,2.0,,4.0,5.0,65.0,7.5
49,37.2,84.0,48.0,3.0,3.0,5.0,2.0,4.0,1.0,2.0,1.0,2.0,,2.0,1.0,73.0,5.5
59,,96.0,,3.0,3.0,3.0,2.0,5.0,4.0,4.0,1.0,2.0,,4.0,5.0,60.0,
62,37.8,88.0,22.0,2.0,1.0,2.0,1.0,3.0,,,2.0,,,4.0,,64.0,8.0
63,38.2,130.0,16.0,4.0,3.0,4.0,2.0,2.0,4.0,4.0,1.0,1.0,,,,65.0,82.0


## Outliers summary

Outliers for all 4 Quantitative measure ('rectal temperature', 'pulse', 'respiratory rate', 'packed cell volume') seem  plausible based o nthe description of the data - will keep them.

Though the qulitative measure 'capillary refill time' contains value '3' not present in description (see above) - we will replace it with '2'.

In [21]:
df['capillary refill time'] = df['capillary refill time'].replace({3: 2})

In [22]:
# check
share_in_total('capillary refill time')

Unnamed: 0_level_0,"share in total, %"
capillary refill time,Unnamed: 1_level_1
1.0,70.149254
2.0,29.850746


# 3. Missing values

In [23]:
def missing (measure):
        print(f'{len(df)-measure.count()} ({(len(df)-measure.count())*100/len(df)}%)')

In [24]:
for column in analysis_list:
    print(column)
    measure=df[column]
    missing(measure)
    print('\n')

rectal temperature
60 (20.0%)


pulse
24 (8.0%)


respiratory rate
58 (19.333333333333332%)


temperature of extremities
56 (18.666666666666668%)


peripheral pulse
69 (23.0%)


mucous membranes
47 (15.666666666666666%)


capillary refill time
32 (10.666666666666666%)


packed cell volume
29 (9.666666666666666%)




### Approach to replacing missing values: 
#### 4:  rectal temperature
60 (20.0%) records are missing. The missing values can be filled in with Median based on temperature of extremities (hot extremities should correlate with an elevated rectal temp) and type of leasion (25) if second number indicates inflamation (3)
  
####  5:  pulse
24 (8.0%) records are missing. The missing values can be filled in with Median based on peripheral pulse.

 
####  6:  respiratory rate
58 (19%) records are missing. The missing values can be filled with Median since 'usefulness is doubtful due to the great fluctuations'.

####  7:  temperature of extremities
56 (18.7%) records are missing. The missing values can be filled with Mode based on rectal temp (in case of elevated rectal temp). 
 
####  8:  peripheral pulse
69 (23.0%) records are missing. The missing values can be filled in with Mode based on capillary refill time. 


####  9:  mucous membranes
47 (15.6%) records are missing. The missing values can be filled in with Mode based on temperature of extremities. 

#### 10: capillary refill time
32 (10.6%) records are missing. The missing values can be filled in with Mode based on mucous membranes.

####  19: packed cell volume
29 (9.7%) records are missing. The missing values can be filled in with Median based on peripheral pulse and capillary refill time, as it's related with circulation properties.         

In [25]:
#create new df for adding missing values + group rectal temp to use in calcualtion 
df_new = pd.DataFrame(df) #(df[analysis_list])
df_new['rect_temp_group'] = df_new['rectal temperature'].apply([lambda x: 'Normal temp' if x<= 38 else 'Elevated temp'])
df_new.head()

Unnamed: 0,surgery?,Age,Hospital Number,rectal temperature,pulse,respiratory rate,temperature of extremities,peripheral pulse,mucous membranes,capillary refill time,...,total protein,abdominocentesis appearance,abdomcentesis total protein,outcome,surgical lesion?,type of lesion,type of lesion1,type of lesion2,cp_data,rect_temp_group
0,2.0,1,530101,38.5,66.0,28.0,3.0,3.0,,2.0,...,8.4,,,2.0,2,11300,0,0,2,Elevated temp
1,1.0,1,534817,39.2,88.0,20.0,,,4.0,1.0,...,85.0,2.0,2.0,3.0,2,2208,0,0,2,Elevated temp
2,2.0,1,530334,38.3,40.0,24.0,1.0,1.0,3.0,1.0,...,6.7,,,1.0,2,0,0,0,1,Elevated temp
3,1.0,9,5290409,39.1,164.0,84.0,4.0,1.0,6.0,2.0,...,7.2,3.0,5.3,2.0,1,2208,0,0,1,Elevated temp
4,2.0,1,530255,37.3,104.0,35.0,,,6.0,2.0,...,7.4,,,2.0,2,4300,0,0,2,Normal temp


In [26]:
# rectal temperature
df_new['rectal temperature'].fillna(df_new.groupby('temperature of extremities', dropna=False)['rectal temperature'].transform('median'), inplace=True)

In [27]:
# pulse
df_new['pulse'].fillna(df_new.groupby(['peripheral pulse'], dropna=False)['pulse'].transform('median'), inplace=True)

In [28]:
# respiratory rate
df_new['respiratory rate'].fillna(df_new['respiratory rate'].median(), inplace=True)

In [29]:
# temperature of extremities
df_new['temperature of extremities'].fillna(df_new.groupby('rect_temp_group', dropna=False)['temperature of extremities'].transform('median'), inplace=True)

In [30]:
# peripheral pulse
df_new['peripheral pulse'].fillna(df_new.groupby('capillary refill time', dropna=False)['peripheral pulse'].transform('median'), inplace=True)

In [31]:
# mucous membranes
df_new['mucous membranes'].fillna(df_new.groupby('temperature of extremities', dropna=False)['mucous membranes'].transform('median'), inplace=True)

In [32]:
# capillary refill time
df_new['capillary refill time'].fillna(df_new.groupby('mucous membranes', dropna=False)['capillary refill time'].transform('median'), inplace=True)

In [33]:
# packed cell volume
df_new['packed cell volume'].fillna(df_new.groupby(['peripheral pulse','capillary refill time'], dropna=False)['packed cell volume'].transform('median'), inplace=True)

In [277]:
# check columns for NAs
for column in analysis_list:
    print(column)
    measure=df_new[column]
    missing(measure)
    print('\n')

rectal temperature
0 (0.0%)


pulse
0 (0.0%)


respiratory rate
0 (0.0%)


temperature of extremities
0 (0.0%)


peripheral pulse
0 (0.0%)


mucous membranes
0 (0.0%)


capillary refill time
0 (0.0%)


packed cell volume
0 (0.0%)


