# ðŸ“Š Statistics

## Exploratory Data Analysis (Descriptive Statistic)
* **summarizing**, **organizing**, and **simplifying**

In [8]:
import pandas as pd
import numpy as np
import scipy.stats as sts

In [54]:
# Create the data
data = {
    "State": ["Alabama", "Alaska", "Arizona", "Arkansas", "California", 
              "Colorado", "Connecticut", "Delaware"],
    "Population": [4779736, 710231, 6392017, 2915918, 37253956, 
                   5029196, 3574097, 897934],
    "Murder rate": [5.7, 5.6, 4.7, 5.6, 4.4, 2.8, 2.4, 5.8],
    "Abbreviation": ["AL", "AK", "AZ", "AR", "CA", "CO", "CT", "DE"]
}

# Create the DataFrame
df = pd.DataFrame(data)

# Display the DataFrame
df

Unnamed: 0,State,Population,Murder rate,Abbreviation
0,Alabama,4779736,5.7,AL
1,Alaska,710231,5.6,AK
2,Arizona,6392017,4.7,AZ
3,Arkansas,2915918,5.6,AR
4,California,37253956,4.4,CA
5,Colorado,5029196,2.8,CO
6,Connecticut,3574097,2.4,CT
7,Delaware,897934,5.8,DE


### Compute the mean, trimmed mean, and median for the population

In [55]:
# Mean
mean = df['Population'].mean()
print(mean)

7694135.625


In [56]:
# Trimmed mean
trimmed_mean = sts.trim_mean(df['Population'], proportiontocut=0.10)
print(trimmed_mean)

7694135.625


In [57]:
# Median
median = df['Population'].median()
print(median)

4176916.5


### Weighted Mean Applicable Case:
* When data points are not equally significant
* Groups have different sizes
* Items have different importance
* Dealing woth pre-grounded (aggregated) data

In [58]:
df

Unnamed: 0,State,Population,Murder rate,Abbreviation
0,Alabama,4779736,5.7,AL
1,Alaska,710231,5.6,AK
2,Arizona,6392017,4.7,AZ
3,Arkansas,2915918,5.6,AR
4,California,37253956,4.4,CA
5,Colorado,5029196,2.8,CO
6,Connecticut,3574097,2.4,CT
7,Delaware,897934,5.8,DE


In [59]:
# Basic Implementation

numerator = (df['Murder rate'] * df['Population']).sum()
denominator = df['Population'].sum()

weighted_mean = numerator / denominator
print(f'weighted_mean: {weighted_mean:.2f}')

weighted_mean: 4.38


In [60]:
# Using the np.average() method
weighted_mean = np.average(df['Murder rate'], weights = df['Population'])
print(f'weighted_mean: {weighted_mean:.2f}')

weighted_mean: 4.38


### Weighted Median
**Steps:**  
* Sort the data in ascending order
* Calculate cumulative weights $(S_i)$ for each data point
* Find the total weight $W_total$
* The weighted median is the first value $x_k$ where the cumulative weight $S_k >=$ to half the total weight

In [65]:
df_sorted = df.sort_values(by = 'Murder rate')
cumulative_weight = df_sorted['Population'].cumsum()

total_weight = df_sorted['Population'].sum()
half_of_total_weight = total_weight / 2

weighted_median = df_sorted[cumulative_weight >= half_of_total_weight].iloc[0]
print(f'Weighted Median: {weighted_median['Murder rate']}')
print(f'Simple Median: {df_sorted['Murder rate'].median()}')

Weighted Median: 4.4
Simple Median: 5.15
