<h1 style='font-size: 35px; color: crimson; font-family: Colonna MT; text-align: center; font-weight: 600'>Descriptive Statistics</h1>

---

*This project focuses on developing Python script functions designed to enhance and streamline the process of exploratory data analysis by computing a comprehensive range of descriptive statistics. Moving beyond the limitations of the standard `describe()` function, these scripts aim to provide a deeper and more flexible understanding of data by capturing both central tendency and dispersion measures. A key feature of this initiative is the ability to generate these statistics across different groups within the dataset, enabling detailed, structured insights with minimal manual effort. The overarching goal is to empower analysts and researchers with tools that make data exploration more intuitive, efficient, and insightful, fostering quicker discovery of patterns and anomalies across diverse datasets.*

**To explore the distribution of continuous variables in our dataset we supposed to examining key statistics.** 

- The **Mean** gives us the average value.
- **Median** provides the middle value, offering a more robust measure against outliers.
- The **Mode** identifies the most frequent value.
- **Standard Deviation** and **Variance** show how much the data deviates from the mean, with larger values indicating greater spread.
- The **Range** reveals the difference between the maximum and minimum values,
- while **Skewness** measures the symmetry of the distribution.
- Lastly, **Kurtosis** tells us about the presence of outliers by analyzing the "tailedness" of the distribution.

*By incorporating these metrics, the analysis will better capture aspects like asymmetry, variability, and tail heaviness, helping to more fully understand the underlying data distribution.*

<h1 style='font-size: 20px; font-family: Colonna MT; font-weight: 600'>1.0: Import Required Libraries</h1>

In [1]:
from scipy.stats import skew, kurtosis 
import scipy.stats as stats  
import pandas as pd  
import numpy as np 
import warnings 

warnings.simplefilter("ignore")  
pd.set_option('display.max_columns', 8) 
pd.set_option('display.float_format', lambda x: '%.2f' % x) 
print("......Libraries Loaded Successfully.........")

......Libraries Loaded Successfully.........


<h1 style='font-size: 20px; font-family: Colonna MT; font-weight: 600'>2.0: Import and Preprocessing Dataset</h1>

In [16]:
def Loading_iris_data():
    from sklearn.datasets import load_iris
    iris = load_iris()
    X, y = iris.data , iris.target 
    
    feature_names, target_names = iris.feature_names, iris.target_names
    df = pd.DataFrame(X, columns=feature_names)
    df['Species'] = y
    df['Species'] = df['Species'].map({i: name for i, name in enumerate(target_names)})
    return df

df = Loading_iris_data()
display(df.head(10))

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),Species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
5,5.4,3.9,1.7,0.4,setosa
6,4.6,3.4,1.4,0.3,setosa
7,5.0,3.4,1.5,0.2,setosa
8,4.4,2.9,1.4,0.2,setosa
9,4.9,3.1,1.5,0.1,setosa


<h1 style='font-size: 20px; font-family: Colonna MT; font-weight: 600'>3.0: Dataset Informations/ Overview</h1>

In [17]:
df.shape

(150, 5)

In [18]:
for column in df.columns.tolist(): print(f"{'-'*15} {column}")

--------------- sepal length (cm)
--------------- sepal width (cm)
--------------- petal length (cm)
--------------- petal width (cm)
--------------- Species


In [19]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   sepal length (cm)  150 non-null    float64
 1   sepal width (cm)   150 non-null    float64
 2   petal length (cm)  150 non-null    float64
 3   petal width (cm)   150 non-null    float64
 4   Species            150 non-null    object 
dtypes: float64(4), object(1)
memory usage: 6.0+ KB


<h4 style='font-size: 18px; color: blue; font-family: Colonna MT; font-weight: 600'>3.2: Columns Summary</h4>

In [20]:
def column_summary(df):
    summary_data = []
    for col_name in df.columns:
        col_dtype = df[col_name].dtype
        num_of_nulls = df[col_name].isnull().sum()
        num_of_non_nulls = df[col_name].notnull().sum()
        num_of_distinct_values = df[col_name].nunique()
        
        if num_of_distinct_values <= 10:
            distinct_values_counts = df[col_name].value_counts().to_dict()
        else:
            top_10_values_counts = df[col_name].value_counts().head(10).to_dict()
            distinct_values_counts = {k: v for k, v in sorted(top_10_values_counts.items(), key=lambda item: item[1], reverse=True)}

        summary_data.append({
            'col_name': col_name,
            'col_dtype': col_dtype,
            'num_of_nulls': num_of_nulls,
            'num_of_non_nulls': num_of_non_nulls,
            'num_of_distinct_values': num_of_distinct_values,
            'distinct_values_counts': distinct_values_counts
        })
    
    summary_df = pd.DataFrame(summary_data)
    return summary_df

summary_df = column_summary(df)
display(summary_df)

Unnamed: 0,col_name,col_dtype,num_of_nulls,num_of_non_nulls,num_of_distinct_values,distinct_values_counts
0,sepal length (cm),float64,0,150,35,"{5.0: 10, 5.1: 9, 6.3: 9, 5.7: 8, 6.7: 8, 5.8:..."
1,sepal width (cm),float64,0,150,23,"{3.0: 26, 2.8: 14, 3.2: 13, 3.4: 12, 3.1: 11, ..."
2,petal length (cm),float64,0,150,43,"{1.4: 13, 1.5: 13, 5.1: 8, 4.5: 8, 1.6: 7, 1.3..."
3,petal width (cm),float64,0,150,22,"{0.2: 29, 1.3: 13, 1.8: 12, 1.5: 12, 1.4: 8, 2..."
4,Species,object,0,150,3,"{'setosa': 50, 'versicolor': 50, 'virginica': 50}"


<h4 style='font-size: 18px; color: blue;  font-family: Colonna MT; font-weight: 600'>3.3: Exploring Invalid Entries Dtypes</h4>

Exploring invalid entries in data types involves identifying values that do not match the expected format or category within each column. This includes detecting inconsistencies such as numerical values in categorical fields, incorrect data formats, or unexpected symbols and typos. Invalid entries can lead to errors in analysis and model performance, making it essential to standardize data types and correct anomalies.

In [21]:
def simplify_dtype(dtype):
    if dtype in (int, float, np.number): return 'Numeric'
    elif np.issubdtype(dtype, np.datetime64): return 'Datetime'
    elif dtype == str: return 'String'
    elif dtype == type(None): return 'Missing'
    else: return 'Other'

def analyze_column_dtypes(df):
    all_dtypes = {'Numeric', 'Datetime', 'String', 'Missing', 'Other'}
    results = pd.DataFrame(index=df.columns, columns=list(all_dtypes), dtype=object).fillna('-')
    
    for column in df.columns:
        dtypes = df[column].apply(lambda x: simplify_dtype(type(x))).value_counts()
        percentages = (dtypes / len(df)) * 100
        for dtype, percent in percentages.items():
            if percent > 0:
                results.at[column, dtype] = f'{percent:.2f}%'  # Add % sign and format to 2 decimal places
            else:
                results.at[column, dtype] = '-'  # Add dash for 0%
    return results

results = analyze_column_dtypes(df)
display(results)

Unnamed: 0,String,Missing,Datetime,Other,Numeric
sepal length (cm),-,-,-,-,100.00%
sepal width (cm),-,-,-,-,100.00%
petal length (cm),-,-,-,-,100.00%
petal width (cm),-,-,-,-,100.00%
Species,-,-,-,100.00%,-


<h1 style='font-size: 25px; font-family: Colonna MT; font-weight: 600'>4.0: Statistic Description of The Datasets</h1>

Now, let's examine the descriptive statistics of the data using Pandas’ built-in functions. This step provides a quick summary of key statistical measures like mean, median, standard deviation, and percentiles, giving an initial overview of the dataset’s distribution and characteristics before applying more detailed custom analyses. This gives us a bird's-eye view of the data, helping us understand the general distribution and characteristics of the values.

In [14]:
summary_statistics = df.describe().reset_index()
summary_statistics

Unnamed: 0,index,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,count,150.0,150.0,150.0,150.0
1,mean,5.84,3.06,3.76,1.2
2,std,0.83,0.44,1.77,0.76
3,min,4.3,2.0,1.0,0.1
4,25%,5.1,2.8,1.6,0.3
5,50%,5.8,3.0,4.35,1.3
6,75%,6.4,3.3,5.1,1.8
7,max,7.9,4.4,6.9,2.5


<h4 style='font-size: 18px; color: blue; font-family: colonna mt; font-weight: 600'>4.2:  Distribution of Continuous variables</h4>

To build on the basic descriptive statistics provided earlier, this project aims to include additional important measures such as skewness, SEM, variance, and kurtosis. These statistics offer deeper insights into the shape, spread, and behavior of continuous variables within the dataset. By incorporating these metrics, the analysis will better capture aspects like asymmetry, variability, and tail heaviness, helping to more fully understand the underlying data distribution. The following scripts will implement these calculations to enrich the overall exploratory process.

In [15]:
def distributions_statistics(df):
    results = []
    for col in df.select_dtypes(include=[np.number]).columns:
        mean = df[col].mean()
        median = df[col].median()
        mode = df[col].mode().iloc[0] if not df[col].mode().empty else np.nan
        std_dev = df[col].std()
        variance = df[col].var()
        value_range = df[col].max() - df[col].min()
        skewness_val = skew(df[col], nan_policy='omit')  # Skewness
        kurtosis_val = kurtosis(df[col], nan_policy='omit')  # Kurtosis

        results.append({
            'Parameter': col,
            'Mean': mean,
            'Median': median,
            'Mode': mode,
            'Standard Deviation': std_dev,
            'Variance': variance,
            'Range': value_range,
            'Skewness': skewness_val,
            'Kurtosis': kurtosis_val
        })

    results = pd.DataFrame(results)
    return results

pd.set_option('display.max_columns', 10) 
results = distributions_statistics(df)
display(results)

Unnamed: 0,Parameter,Mean,Median,Mode,Standard Deviation,Variance,Range,Skewness,Kurtosis
0,sepal length (cm),5.84,5.8,5.0,0.83,0.69,3.6,0.31,-0.57
1,sepal width (cm),3.06,3.0,3.0,0.44,0.19,2.4,0.32,0.18
2,petal length (cm),3.76,4.35,1.4,1.77,3.12,5.9,-0.27,-1.4
3,petal width (cm),1.2,1.3,0.2,0.76,0.58,2.4,-0.1,-1.34


<h4 style='font-size: 18px; color: blue; font-family: colonna mt; font-weight: 600'>4.3:  Group-wise Distribution of Continuous variables</h4>

Additionally, there are cases where it's important to analyze how statistical measures of variables differ across various groups within the dataset. Grouping statistics in this way allows for a clearer understanding of how variables behave under different categories or conditions—such as across regions, time periods, or treatment types. To support this need, a modified logic is implemented below, enabling the computation of descriptive statistics within each group. This approach enhances the depth and clarity of the data exploration process.

In [8]:
def group_distributions_statistics(df, group_column):
    results = []
    grouped = df.groupby(group_column)
    for col in df.select_dtypes(include=[np.number]).columns:
        if col != group_column:
            for group_name, group_data in grouped:
                mean = group_data[col].mean()
                median = group_data[col].median()
                mode = group_data[col].mode().iloc[0] if not group_data[col].mode().empty else np.nan
                std_dev = group_data[col].std()
                variance = group_data[col].var()
                cv = group_data[col].std() / group_data[col].mean() * 100  # Coefficient of Variation
                value_range = group_data[col].max() - group_data[col].min()
                
        
                skewness_val = skew(group_data[col], nan_policy='omit')  # Skewness
                kurtosis_val = kurtosis(group_data[col], nan_policy='omit')  # Kurtosis
                
                
                n = len(group_data[col])  # Sample size
                sem = std_dev / np.sqrt(n) if n > 1 else np.nan  # Standard error of the mean

                # Append the results to the list
                results.append({
                    group_column: group_name,
                    'Variables': col,
                    'Mean': mean,
                    'SEM': sem,
                    'Median': median,
                    'Mode': mode,
                    'Standard Deviation': std_dev,
                    'Variance': variance,
                    'Coefficient of Variation': cv,
                    'Range': value_range,
                    'Skewness': skewness_val,
                    'Kurtosis': kurtosis_val,
                })

    result_df = pd.DataFrame(results)
    return result_df

group_column = 'Species'
pd.set_option('display.max_columns', 12) 
results = group_distributions_statistics(df, group_column)
display(results)

Unnamed: 0,Species,Variables,Mean,SEM,Median,Mode,Standard Deviation,Variance,Coefficient of Variation,Range,Skewness,Kurtosis
0,setosa,sepal length (cm),5.01,0.05,5.0,5.0,0.35,0.12,7.04,1.5,0.12,-0.35
1,versicolor,sepal length (cm),5.94,0.07,5.9,5.5,0.52,0.27,8.7,2.1,0.1,-0.6
2,virginica,sepal length (cm),6.59,0.09,6.5,6.3,0.64,0.4,9.65,3.0,0.11,-0.09
3,setosa,sepal width (cm),3.43,0.05,3.4,3.4,0.38,0.14,11.06,2.1,0.04,0.74
4,versicolor,sepal width (cm),2.77,0.04,2.8,3.0,0.31,0.1,11.33,1.4,-0.35,-0.45
5,virginica,sepal width (cm),2.97,0.05,3.0,3.0,0.32,0.1,10.84,1.6,0.35,0.52
6,setosa,petal length (cm),1.46,0.02,1.5,1.4,0.17,0.03,11.88,0.9,0.1,0.8
7,versicolor,petal length (cm),4.26,0.07,4.35,4.5,0.47,0.22,11.03,2.1,-0.59,-0.07
8,virginica,petal length (cm),5.55,0.08,5.55,5.1,0.55,0.3,9.94,2.4,0.53,-0.26
9,setosa,petal width (cm),0.25,0.01,0.2,0.2,0.11,0.01,42.84,0.5,1.22,1.43


<h4 style='font-size: 18px; color: Blue; font-family: colonna mt; font-weight: 600'>4.4: Comparatives Analysis </h4>

Now, let’s turn our attention to comparing the means of variables across different specified groups. This approach helps us understand how each variable behaves within various categories or groups in term of average distibutions. For instance, we might explore how the average outcome of a variable changes across different specie. Such comparisons allow us to identify any significant differences between groups, uncovering patterns or trends that could be crucial for deeper analysis. By analyzing these mean comparisons, we gain valuable insights into the relationships between variables and groups.

In [25]:
def summary_stats(df, group=''):
    Metrics = df.select_dtypes(include=np.number).columns.tolist()
    df_without_location = df.drop(columns=[group])
    grand_mean = df_without_location[Metrics].mean()
    sem = df_without_location[Metrics].sem()
    cv = df_without_location[Metrics].std() / df_without_location[Metrics].mean() * 100
    grouped = df.groupby(group)[Metrics].agg(['mean', 'sem']).reset_index()
    
    summary_df = pd.DataFrame()
    for col in Metrics:
        summary_df[col] = grouped.apply(
            lambda x: f"{x[(col, 'mean')]:.2f} ± {x[(col, 'sem')]:.2f}", axis=1
        )
    
    summary_df.insert(0, group, grouped[group])
    grand_mean_row = ['Grand Mean'] + grand_mean.tolist()
    sem_row = ['SEM'] + sem.tolist()
    cv_row = ['%CV'] + cv.tolist()
    
    summary_df.loc[len(summary_df)] = grand_mean_row
    summary_df.loc[len(summary_df)] = sem_row
    summary_df.loc[len(summary_df)] = cv_row
    
    return summary_df


results = summary_stats(df, group='Species')
display(results)

Unnamed: 0,Species,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,setosa,5.01 ± 0.05,3.43 ± 0.05,1.46 ± 0.02,0.25 ± 0.01
1,versicolor,5.94 ± 0.07,2.77 ± 0.04,4.26 ± 0.07,1.33 ± 0.03
2,virginica,6.59 ± 0.09,2.97 ± 0.05,5.55 ± 0.08,2.03 ± 0.04
3,Grand Mean,5.84,3.06,3.76,1.20
4,SEM,0.07,0.04,0.14,0.06
5,%CV,14.17,14.26,46.97,63.56


---

This analysis was performed by **Jabulente**, a passionate and dedicated data scientist with a strong commitment to using data to drive meaningful insights and solutions. For inquiries, collaborations, or further discussions, please feel free to reach out via.  

    
<div align="center">  
    
[![GitHub](https://img.shields.io/badge/GitHub-Jabulente-black?logo=github)](https://github.com/Jabulente)  [![LinkedIn](https://img.shields.io/badge/LinkedIn-Jabulente-blue?logo=linkedin)](https://linkedin.com/in/jabulente-208019349)  [![Email](https://img.shields.io/badge/Email-jabulente@hotmail.com-red?logo=gmail)](mailto:Jabulente@hotmail.com)  

</div>


<h5 style='font-size: 18px; color: crimson; font-family: Colonna MT; font-weight: 600; text-align: center'>THE END</h5>