# Data Analysis

Display information for all numerical feature:
- Count - number of values in the dataset
- Mean - average value
- Std deviation - measure of the dispersion of the values
- Min - smallest value
- 25% (Q1) - median of the lower half of the dataset
- 50% (Q2) - median of the dataset
- 75% (Q3) - median of the upper half of the dataset
- Max - largest value
- Mode - most frequently occurring value

**Additional information:**
- Interquartile Range (IQR) - Range between Q1 and Q3, measure of dispersion of the middle 50% of the dataset
- Skewness - measure of asymmetry of the distribution of values
    - a value of 0 means that the distribution is approximately symmetric
$$S = \frac{\sum_{i=1}^{n}(X_i - \bar{X})^3/n}{(\sum_{i=1}^{n}(X_i - \bar{X})^2/n)^{3/2}}$$
- Kurtosis - A measure of 'tailedness' of the distribution
$$K = \frac{\sum_{i=1}^{n}(X_i - \bar{X})^4/n}{(\sum_{i=1}^{n}(X_i - \bar{X})^2/n)^2} - 3$$

In these equations:
- $X_i$ represnets each individual data point
- $\bar{X}$ is the mean of the dataset
- $n$ is the number of data points



![skewness-kurtosis\_1JPG-.jpg (733Ã—536)](https://excelrcom.b-cdn.net/assets/admin/ckfinder/userfiles/images/tableau1/tableau2/tableau3/tableau4/tableau5/tableau6/skewness-kurtosis_1JPG-.jpg)


## Describe Program

In [2]:
import math
import pandas as pd
import numpy as np

%run "utils.ipynb"

def get_min(data):
    min = data[0]
    for x in data:
        if x < min:
            min = x
    return min

def get_max(data):
    max = data[0]
    for x in data:
        if x > max:
            max = x
    return max

def get_mode(data):
    counts = {}
    for value in data:
        if value in counts:
            counts[value] += 1
        else:
            counts[value] = 1

    # Collect all the values with the maximum count
    mode_values = [key for key, value in counts.items() if value == get_max(list(counts.values()))]
    if not mode_values:
        return np.nan

    return mode_values[0]

def get_skewness(data, mean, std, count):
    if std == 0 or count == 0:
        return 0
        
    # The use of cubic is a way to emphasize the impact of extreme values on skewness
    scaled_data = [(x - mean)**3 / std**3 for x in data]
    
    # compute the mean of the new array
    return sum(scaled_data) / count

def get_kurtosis(data, mean, std, count):
    if std == 0 or count == 0:
        return 0

    # The use of cubic is a way to emphasize the impact of extreme values on skewness
    scaled_data = [(x - mean)**4 for x in data]
    scaled_mean = sum(scaled_data) / count
    
    return scaled_mean / std**4 - 3
    
def calculate_statistics(data, name):
    #filter NAN
    data = data[~np.isnan(data)]
    
    count = len(data)

    mean = 0 if count == 0 else sum(data) / count

    variance = 0 if count == 0 else sum((x - mean) ** 2 for x in data) / count
    std = math.sqrt(variance)
    
    minimum = get_min(data)
    maximum = get_max(data)

    sorted_data = sorted(data)
    q1 = sorted_data[int(0.25 * count)]
    median = sorted_data[int(0.5 * count)]
    q3 = sorted_data[int(0.75 * count)]

    iqr = q3 - q1

    mode = get_mode(data)

    skewness = get_skewness(data, mean, std, count)
    kurtosis = get_kurtosis(data, mean, std, count)

    print(f"{name[:15]:<18}{count:<12}{mean:<12.3f}{std:<12.3f}{minimum:<12.3f}{q1:<12.3f}{median:<12.3f}{q3:<12.3f}{maximum:<12.3f}{iqr:<12.3f}{mode:<12.3f}{skewness:<12.3f}{kurtosis:<12.3f}")

df = get_data()
print(df)

# filter numeric features
numeric_features = df.select_dtypes(include=['number']).columns

# Display information for numeric features
print(f"\n\033[1m{'Feature':<18}{'Count':<12}{'Mean':<12}{'Std':<12}{'Min':<12}{'25%':<12}{'50%':<12}{'75%':<12}{'Max':<12}{'IQR':<12}{'Mode':<12}{'Skewness':<12}{'Kurtosis':<12}\033[0m")
for column in numeric_features:
    calculate_statistics(df[column], column)
    

      Hogwarts House First Name    Last Name    Birthday Best Hand  \
Index                                                                
0          Ravenclaw     Tamara          Hsu  2000-03-30      Left   
1          Slytherin      Erich      Paredes  1999-10-14     Right   
2          Ravenclaw   Stephany        Braun  1999-11-03      Left   
3         Gryffindor      Vesta    Mcmichael  2000-08-19      Left   
4         Gryffindor     Gaston        Gibbs  1998-09-27      Left   
...              ...        ...          ...         ...       ...   
1595      Gryffindor       Jung        Blank  2001-09-14     Right   
1596       Slytherin     Shelli         Lock  1998-03-12      Left   
1597      Gryffindor   Benjamin  Christensen  1999-10-24     Right   
1598      Hufflepuff  Charlotte       Dillon  2001-09-21      Left   
1599      Hufflepuff      Kylie        Nowak  2000-08-21      Left   

       Arithmancy   Astronomy  Herbology  Defense Against the Dark Arts  \
Index         