# Exploratory data analysis before preprocessing
The aim of this notebook is to identify problems with single features. This notebook does not explore the influence of features between eachother. 

Goals:
- Describe every feature
- Identify Problems for every feature
- Collect ideas on how to collapse data, impute missing data or even drop data

TODO: 
- How to handle minorities in category race, eg. "Native Hawaiian or Other Pacific Islander" (3 observations) or "American Indian or Alaska Native" (1 observation)
- Suggest renaming of feature "loc" as it is also function used in pandas
- Inquire about the purpose of the feature "other"

In [1]:
# imports (install missing libraries by running "!pip install 'libraryname'" in a cell)
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sys
import missingno as msno 
sys.path.insert(0, '../preprocessing/')
import pipe

# set plot style and size:
sns.set(rc={'figure.figsize':(9,6)})
sns.set_style("ticks")
f'pandas version: {pd.__version__}'

'pandas version: 1.0.5'

## Helper functions

In [2]:
def countplot(df, column, title, subtitle, xlabel, ylabel, xtick_rotation = None):
    g = sns.countplot(x=df[column])

    # Title, Subtitle and Axis
    g.text(x=0.5, 
            y=1.06, 
            s=title, 
            fontsize=10, weight='bold', ha='center', va='bottom', transform=g.transAxes)
    g.text(x=0.5, 
            y=1.01, 
            s=subtitle, 
            fontsize=10, alpha=0.75, ha='center', va='bottom', transform=g.transAxes)
    plt.xlabel(xlabel)
    plt.ylabel(ylabel)
    if xtick_rotation is not None:
        plt.xticks(rotation=xtick_rotation)
    plt.show()

In [4]:
# import dataframe
df = pd.read_excel("../data/uveitis_data.xlsx")
assert len(df) >= 1075, "Data is not complete"

# rename columns
df = pipe.rename(df, "../data/col_names&data_type.xlsx")

FileNotFoundError: [Errno 2] No such file or directory: '../data/col_names&data_type1.xlsx'

## Missing Values

In [None]:
msno.matrix(df, sparkline=False);

In [None]:
df.head()

In [None]:
df.columns.tolist()

## id
Unique identifier of an observation, aka a patient

In [None]:
assert len(df['id'].unique()) == len(df['id']), "Id is not unique"

In [None]:
df.id.sample(3)

## gender
A categorical variable that defines the gender of a patient

In [None]:
count = df.gender.value_counts()
n_male, n_female = count['Male'], count['Female']
countplot(df, 'gender', f'Distribution of the feature "gender"', f'Male: {n_male}, Female: {n_female}', f'Gender', f'Nr. of Patients')

## race
Describes the ethnicity of the patient. This information is provided by the patient himself.

In [None]:
df.race.value_counts()

In [None]:
count = df.race.value_counts()
g = sns.countplot(y=df.race)

# Title, Subtitle and Axis
g.text(x=0.5, 
        y=1.06, 
        s=f'Distribution of the feature "race"', 
        fontsize=10, weight='bold', ha='center', va='bottom', transform=g.transAxes)
g.text(x=0.5, 
        y=1.01, 
        s=f'This information is provided by the patient', 
        fontsize=10, alpha=0.75, ha='center', va='bottom', transform=g.transAxes)
g.yaxis.tick_right()
# g.yaxis.set_label_position("right")
plt.ylabel(f'Race')
plt.xlabel(f'Nr. of Patients')
plt.show()

In the graph we can see that there is both the category "Unknown Race" (74 observations) and "Race or Ethnic Group Data Not Provided by Source" (1 observation). We suggest to consider both categories as "Missing Values" and to combine them. Also, there are categories that hardly appear in the dataset. These include "Native Hawaiian or Other Pacific Islander" (3 observations) or "American Indian or Alaska Native" (1 observation). 

## loc
aka "Location" describes the location of the inflammation in the eye.

In [None]:
df['loc'] = df['loc'].str.lower().str.strip()
df['loc'].replace({'pan':'panuveitis'})
df['loc'] = df['loc'].astype('category')
df['loc'].value_counts(dropna=False)

In [None]:
countplot(df, 'loc', f'Distribution of the feature "loc" (location)', 'Describes the location of the inflammation in the eye', f'Location', f'Nr. of Patients')

In [None]:
df.ac_abn_od_cells.value_counts()

In addition to the categories anterior (lat. in front), posterior (lat. in back) and intermediate, there is the category panuveities which is an inflammation of the whole uvea tract as well as the retina and the vitreous humor (glass body) (see Bansal et al., 2010). The category "scleritis" refers to inflammation of episcleral and scleral tissue (see Alan et al., o. J.). 

## cat
cat ("Category") describes the origin of the inflammation. This can be, for example, systematic, infectious or idiopathic. This feature is based on the results of laboratory tests and has therefore been recorded retrospectively. However, it is interesting to make a prediction about the patient without the need of manualy aggregated information. We recommend trying to find a model that does not use this feature.

In [None]:
df.cat = df.cat.str.lower().str.strip().astype('category')
df.cat.value_counts(dropna=False)

In [None]:
g = sns.countplot(y="cat", data=df, order = df['cat'].value_counts(dropna=False).index )

# Title, Subtitle and Axis
g.text(x=0.5, 
        y=1.06, 
        s=f'Distribution of the feature "cat" (Category)', 
        fontsize=10, weight='bold', ha='center', va='bottom', transform=g.transAxes)
g.text(x=0.5, 
        y=1.01, 
        s= 'Describes the origin of the inflammation in the eye', 
        fontsize=10, alpha=0.75, ha='center', va='bottom', transform=g.transAxes)
g.yaxis.tick_right()
# g.yaxis.set_label_position("right")
plt.ylabel(f'Category')
plt.xlabel(f'Nr. of Patients')
plt.show()

More than 400 of the observations are idiopathic, i.e. the cause is unknown. Furthermore, the category panuveitis often exists, followed by wds (white dot syndromes) and systemic or infectious origin. Categories containing the term "masquerade" refer to a form of pseudo-uveitis (see Smith et al., 1986). The inflammation appears to be uveitis, but it is not. There is also an observation of scleritis and one labeled "not_uveitis". We advise to combine all forms of pseudo-uveitis and treat the categories with only one observation as missing values.

## other_

In [None]:
print(f'{round(df.other_.isna().sum()/len(df),3)*100} % missing values')

In [None]:
df.other_.unique()

## ehr_diagnosis
EHR diagnosos is an electronic transmitted diagnosis, usually given beforehand by another doctor, not knowing about the lab results and final diagnosis. This feature contains a lot of diffrent categories (533 unique values). 

In [None]:
print(f'{round(df.ehr_diagnosis.isna().sum()/len(df),3)*100} % missing values')

In [None]:
print(f'{len(df.ehr_diagnosis.str.strip().str.lower().unique())} unique values')

In [None]:
df.ehr_diagnosis.str.strip().str.lower().value_counts(dropna=False)

## specific_diagnosis 
This feature is considered by the client to be the most important target variable. It indicates the diagnosis after laboratory testing by the team that provided the dataset. Some categories, such as "idiopathic_anterior" and "idopathic_scleritis" can probably be combined after consultation with the client. Diagnoses that cannot be assigned to a form of uveitis could be designated as "not_uveitis" and grouped together. 

In [None]:
df.specific_diagnosis.str.lower().str.strip().value_counts(dropna=False)

## notes 

Contains notes/comments on the specific diagnosis. This feature has about 60% missing values. At this stage it is a candidate to be dropped. Later it should be checked how the data is related to the feature specific_diagnosis.

In [None]:
print(f'{round(df.notes.isna().sum()/len(df),2)*100} % missing values')

In [None]:
df.notes.value_counts(dropna=False)

In [None]:
countplot(df, 'notes', f'Distribution of the feature "notes"', 'Notes to the specific diagnosis', f'Note', f'Nr. of Patients')

## AC Abn Od Cells and AC Abn Os Cells
These qualitative, ordinal features describe the severity of the inflammation of the Anterior Chamber Cells (AC) in either the left eye (OS) or the right eye (OD). The inflammation can be rated as 0, +0.5, +1, +2, +3, +4. The higher the value the more severe the inflammation is. If either one of these values a patient can be considered as "Active", else as "Quiet". This information could be recorded in a new column. Values marked with 'C' (C = Cannot) can be treated as missing as they indicate that the level of inlammation could not be measured.

In [None]:
df.ac_abn_od_cells.str.strip().str.lower().value_counts()

In [None]:
df.ac_abn_os_cells.str.strip().str.lower().value_counts()

## Vit Abn Od Cells, Vit Abn Os Cells, Vit Abn Od Haze and Vit Abn Os Haze
These features describe (similar to AC Abn O...) the inflammation of cells in the left (OS) and right (OD) eye. The same scale of 0, +0.5, +1, +2, +3, +4 is used. If one of the values is higher than 0 the patient is considered to be "Active" as well. This information can be recorded in a new column as well.

In [None]:
a = df.vit_abn_od_cells.str.strip().str.lower().value_counts()
b = df.vit_abn_od_haze.str.strip().str.lower().value_counts()
c = df.vit_abn_os_cells.str.strip().str.lower().value_counts()
d = df.vit_abn_os_haze.str.strip().str.lower().value_counts()
t = pd.concat([a,b,c,d],axis=1)
t = t.sort_index(ascending=True)
t.T.plot(kind='bar', stacked=True);

### hbc__ab, hbs__ag and hcv__ab

df.loc[df['hbs__ag']These columns encode the lab results for diffrent types of hepatitis. We encode these in binary form. Negative results are '0' and positive results get encoded as '1'. There are some cases where neither a positive or negative result can be identified. These values will be set as missing values. 

In [None]:
# barplot hepatitis-columns
countplot(df, 'hbc__ab', f'Distribution of the feature "hbc_ab"', 'Shows if the patient is HepB core positive', f'Test result', f'Nr. of Patients')
df['hbs__ag'] = df['hbs__ag'].str.lower()
df.loc[df['hbs__ag'] == 'see note | positive result s/co ratio is >5.0.  confirmatory testing i', 'hbs__ag'] = 'reactive'
df.loc[df['hbs__ag'] == 'see below | positive result s/co ratio is >5.0.  confirmatory testing ', 'hbs__ag'] = 'reactive'
df.loc[df['hbs__ag'] == 'note:', 'hbs__ag'] = np.nan
countplot(df, 'hbs__ag', f'Distribution of the feature "hbs_ag"', 'Shows if the patient is HepB Surface positive', f'Test result', f'Nr. of Patients')
    
countplot(df, 'hcv__ab', f'Distribution of the feature "hcv_ab"', 'Shows if the patient is HepC positive', f'Test result', f'Nr. of Patients')

In [None]:
type(df)

In [None]:
def plot_range(data, feat, verbose=False):
    data = pipe.extract_num(data, feat, verbose=verbose)
    sns.set(rc={'figure.figsize':(9,6)})
    sns.set_style("ticks")
    
    # extract uom and ranges for feat
    sub_data = data.loc[:,feat:].iloc[:,:3]
    sub_range = sub_data.loc[:,sub_data.columns[sub_data.columns.str.contains(pat = 'range')]]
    # plot boxplot for every "range" and corresponding uom
    g = sns.boxplot(x=sub_range.iloc[:,0],y=sub_data[feat])
    # overlay min and max range
    ranges = [x for x in sub_range.iloc[:,0].unique() if str(x) != 'nan']
    n_range = len(ranges)
    for num, rang in enumerate(ranges):
        if isinstance(rang, str):
            min_range, max_range = rang.split('-')
            min_range, max_range = float(min_range), float(max_range)
            g.axhline(min_range, xmin=num/n_range+.05, xmax=num/n_range+(1/n_range-.05), ls='--')
            g.axhline(max_range, xmin=num/n_range+.05, xmax=num/n_range+(1/n_range-.05), ls='--')
            g.text(num+0.25,max_range+0.05, "max.")
            g.text(num+0.25,min_range+0.05, "min.")
            
    
    # Title, Subtitle and Axis
    g.text(x=0.5, 
            y=1.08, 
            s=f'Ranges of Feature "{feat}"', 
            fontsize=10, weight='bold', ha='center', va='bottom', transform=g.transAxes)
    g.text(x=0.5, 
            y=1.01, 
            s=f'Data has been transformed beforehand, nonnumerical values are excluded in this visualisation\n min. and max. represent the range of values a test can produce', 
            fontsize=10, alpha=0.75, ha='center', va='bottom', transform=g.transAxes)
    plt.xlabel(f'Range in [{sub_data.iloc[1,1]}]')
    plt.ylabel(f'{feat} [{sub_data.iloc[1,1]}]')

    

    plt.show()
    
plot_range(df, "calcium")

In [None]:
plot_range(df, "lactate_dehydrogenase")

### Sources 

Bansal, R., Gupta, V., & Gupta, A. (2010). Current approach in the diagnosis and management of panuveitis. Indian Journal of Ophthalmology, 58(1), 45–54. https://doi.org/10.4103/0301-4738.58471

Alan, P. M., Feldman M.D., B. H., Hung, J., Tsai, J. H., & Hossain, Dr. K. (o. J.). Scleritis—EyeWiki. Abgerufen 12. März 2021, von https://eyewiki.aao.org/Scleritis

Smith, R. E., Nozik, R. A., & Grabner, G. (1986). Pseudouveitis („Maskerade-Syndrome“). In R. E. Smith, R. A. Nozik, & G. Grabner (Hrsg.), Uveitis: Klinik, Diagnose, Therapie Ein Leidfaden für die Praxis (S. 238–241). Springer. https://doi.org/10.1007/978-3-642-70809-1_38
