## Descriptive Statistics

- mean: np.mean(data)
- median: np.median(data)
- mode: np.mode(data)
- variance: np.var(data)
- std div: np.std(data)
- range: np.ptp(data)
- stats.describe(data)
- 75th percent: np.percentile(data, 75)
- quantile: np.quantile(data, [0.25, 0.5, 0.75])
- coefficient of variation: stats.variation(data)

## Probability Distribution

- normal disribution: stats.norm.pdf(x, loc=0, scale=1) or stats.norm.cdf(x, loc=0, scale=1)

## Hypothesis Testing

- paired t-test: stats.ttest_rel(data1, data2)
- one way ANOVA: stats.f_oneway(data1, data2, data3)
- kruskal-wallis H-test: stats.kruskal(data1, data2, data3)

## Multivariate Analysis

- PCA: from sklearn.decomposition import PCA; PCA().fit_transform(X)
- Canonical correlation analysis: from sklearn.cross_decomposition import CCA; CCA().fit(X, Y).transform(X, Y)

## Time Series Analysis

- autocorrelation: stats.autocorr(data)

## Cluster Analysis

- k-mean clustering: from sklearn.cluster import KMeans; KMeans(n_clusters=k).fit(X)
- Hierarchical clustering: from sklearn.cluster.hierarchy import linkage; linkage(X, method='ward')


In [1]:
import scipy.stats as stats
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

file = '/home/aniket-yadav/WORKSPACE/WebProjects/.visenv/Visualizer/staticfiles/Data/ProstateCancer.xlsx'
df = pd.read_excel(file, sheet_name='ProstateCancer')
spcCent = df["Specialized Centers"].to_list()
df.shape[1] # Number of columns in the DataFrame


21

In [61]:

def get_descriptive_statistics(df):
    """
    Calculate descriptive statistics for each column in the DataFrame.
    
    Parameters:
    df (pd.DataFrame): DataFrame containing the data.
    
    Returns:
    pd.DataFrame: DataFrame containing the descriptive statistics.
    """
    datavals = {}
    for i in range(df.shape[1]-2):
        data = df.iloc[:, i+1].astype(float).to_list()
        datavals[df.iloc[:, i+1].name] = {
            "mean": np.mean(data).__float__(),
            "median": np.median(data).__float__(),
            "std": np.std(data).__float__(),
            "variance": np.var(data).__float__(),
            "min": np.min(data).__float__(),
            "max": np.max(data).__float__(),
            "skew": stats.skew(data).__float__(),
            "kurtosis": stats.kurtosis(data).__float__(),
            "coefficient of variation": stats.variation(data).__float__(),
            "75th percentile": np.percentile(data, 75).__float__(),
            "50th percentile": np.percentile(data, 50).__float__(),
            "25th percentile": np.percentile(data, 25).__float__(),
            "interquartile range": stats.iqr(data).__float__(),
        }
    dstat = pd.DataFrame.from_dict(datavals, orient='index').transpose().transform(lambda x: x.round(3))
    # dstat = pd.read_json(dstat)
    return dstat

dstat = get_descriptive_statistics(df)
dstat

Unnamed: 0,Specialized Centers,Genetic & Molecular Testing Infrastructure (1–5),Treatment Access,Research Funding,Awareness Campaigns,Survival Rates,Early Detection,Palliative Care,PSA,TMPRSS2-ERG,PTEN,Unnamed: 4,Clinical Guideline Implementation (1-5),Feasibility of Integration (1-5),Adoption of Int'l Guidelines (1-5),Engagement with Updates (1-3),ESMO Guidelines Implementation (1-5),Reimbursement Framework,No-cost Access
mean,3.262,3.429,2.881,3.071,3.167,3.167,3.048,3.0,55.833,26.905,22.476,35.048,3.571,3.524,3.5,2.071,3.19,2.143,2.095
median,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,55.0,25.0,20.0,33.0,4.0,4.0,3.5,2.0,3.0,2.0,2.0
std,1.415,1.256,1.295,1.28,1.308,1.462,1.308,1.327,22.091,13.361,13.039,15.943,1.256,1.258,1.258,0.703,1.468,0.774,0.75
variance,2.003,1.578,1.676,1.638,1.71,2.139,1.712,1.762,487.996,178.515,170.011,254.188,1.578,1.583,1.583,0.495,2.154,0.599,0.562
min,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,15.0,5.0,2.0,7.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0
max,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,95.0,50.0,45.0,60.0,5.0,5.0,5.0,3.0,5.0,3.0,3.0
skew,-0.116,-0.343,0.09,-0.065,-0.118,0.03,-0.024,0.122,-0.133,0.143,0.198,0.024,-0.594,-0.484,-0.43,-0.1,-0.196,-0.56,-0.496
kurtosis,-1.251,-0.816,-1.196,-1.007,-0.978,-1.431,-1.152,-1.144,-1.117,-1.044,-1.159,-1.211,-0.603,-0.698,-0.734,-0.98,-1.358,-0.267,-0.135
coefficient of variation,0.434,0.366,0.449,0.417,0.413,0.462,0.429,0.442,0.396,0.497,0.58,0.455,0.352,0.357,0.36,0.34,0.46,0.361,0.358
75th percentile,5.0,4.75,4.0,4.0,4.0,5.0,4.0,4.0,80.0,38.75,35.0,52.0,5.0,5.0,5.0,3.0,4.75,3.0,3.0


In [58]:
dstat

Unnamed: 0,Specialized Centers,Genetic & Molecular Testing Infrastructure (1–5),Treatment Access,Research Funding,Awareness Campaigns,Survival Rates,Early Detection,Palliative Care,PSA,TMPRSS2-ERG,PTEN,Unnamed: 4,Clinical Guideline Implementation (1-5),Feasibility of Integration (1-5),Adoption of Int'l Guidelines (1-5),Engagement with Updates (1-3),ESMO Guidelines Implementation (1-5),Reimbursement Framework,No-cost Access
mean,3.261905,3.428571,2.880952,3.071429,3.166667,3.166667,3.047619,3.0,55.833333,26.904762,22.47619,35.047619,3.571429,3.52381,3.5,2.071429,3.190476,2.142857,2.095238
median,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,55.0,25.0,20.0,33.0,4.0,4.0,3.5,2.0,3.0,2.0,2.0
std,1.415215,1.256277,1.294722,1.279748,1.307791,1.462494,1.308441,1.327368,22.090632,13.360941,13.03884,15.943281,1.256277,1.25808,1.258306,0.70349,1.467718,0.773718,0.749906
variance,2.002834,1.578231,1.676304,1.637755,1.710317,2.138889,1.712018,1.761905,487.996032,178.514739,170.011338,254.188209,1.578231,1.582766,1.583333,0.494898,2.154195,0.598639,0.562358
min,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,15.0,5.0,2.0,7.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0
max,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,95.0,50.0,45.0,60.0,5.0,5.0,5.0,3.0,5.0,3.0,3.0
skew,-0.116325,-0.342615,0.090126,-0.065378,-0.118275,0.0296,-0.024198,0.122168,-0.132826,0.143411,0.197681,0.023976,-0.594061,-0.484249,-0.430224,-0.100488,-0.195724,-0.560206,-0.495966
kurtosis,-1.250857,-0.816327,-1.195922,-1.006842,-0.978128,-1.431366,-1.152307,-1.143901,-1.117288,-1.043965,-1.159221,-1.211235,-0.6033,-0.697544,-0.733676,-0.9797,-1.357895,-0.267304,-0.134723
coefficient of variation,0.433862,0.366414,0.449407,0.416662,0.412987,0.46184,0.429332,0.442456,0.395653,0.496601,0.580118,0.454903,0.351757,0.357023,0.359516,0.339616,0.460031,0.361068,0.357909
75th percentile,5.0,4.75,4.0,4.0,4.0,5.0,4.0,4.0,80.0,38.75,35.0,52.0,5.0,5.0,5.0,3.0,4.75,3.0,3.0


Unnamed: 0,mean,median,std,variance,min,max,skew,kurtosis,coefficient of variation,75th percentile,50th percentile,25th percentile,interquartile range
Specialized Centers,3.261905,3.0,1.415215,2.002834,1.0,5.0,-0.116325,-1.250857,0.433862,5.0,3.0,2.0,3.0
Genetic & Molecular Testing Infrastructure (1–5),3.428571,3.0,1.256277,1.578231,1.0,5.0,-0.342615,-0.816327,0.366414,4.75,3.0,3.0,1.75
Treatment Access,2.880952,3.0,1.294722,1.676304,1.0,5.0,0.090126,-1.195922,0.449407,4.0,3.0,2.0,2.0
Research Funding,3.071429,3.0,1.279748,1.637755,1.0,5.0,-0.065378,-1.006842,0.416662,4.0,3.0,2.0,2.0
Awareness Campaigns,3.166667,3.0,1.307791,1.710317,1.0,5.0,-0.118275,-0.978128,0.412987,4.0,3.0,2.0,2.0
Survival Rates,3.166667,3.0,1.462494,2.138889,1.0,5.0,0.0296,-1.431366,0.46184,5.0,3.0,2.0,3.0
Early Detection,3.047619,3.0,1.308441,1.712018,1.0,5.0,-0.024198,-1.152307,0.429332,4.0,3.0,2.0,2.0
Palliative Care,3.0,3.0,1.327368,1.761905,1.0,5.0,0.122168,-1.143901,0.442456,4.0,3.0,2.0,2.0
PSA,55.833333,55.0,22.090632,487.996032,15.0,95.0,-0.132826,-1.117288,0.395653,80.0,55.0,41.25,38.75
TMPRSS2-ERG,26.904762,25.0,13.360941,178.514739,5.0,50.0,0.143411,-1.043965,0.496601,38.75,25.0,15.0,23.75
