# Assignment 1:  Unlocking Statistical Insights with Python
---

Welcome to Assignment 1! In this assignment, you'll dive into the world of statistical analysis using Python.  We'll move beyond theoretical concepts and focus on practical application. You'll learn how to calculate key statistical measures and use them for real-world tasks like outlier detection.

This assignment will guide you through:

* **Revise the concepts and equations:**  You'll implement code from scratch to compute essential statistics like mean, median, mode, variance, standard deviation, range, and interquartile range (IQR).
* **Calculating Basic Statistics:**  You'll implement functions (and use powerful libraries like NumPy and SciPy) to compute essential statistics like mean, median, mode, variance, standard deviation, range, and interquartile range (IQR).
* **Outlier Detection:** You'll apply the Interquartile Range (IQR) method to identify and filter outliers from datasets, a crucial step in data cleaning and analysis.

By the end of this assignment, you'll have a strong foundation in performing statistical calculations in Python and applying these techniques to understand and refine data.

#### General Instructions 

Please adhere to the following guidelines:

- **Code Clarity:** Your code should be well-formatted, easy to understand, and include meaningful variable names.
- **Docstrings:**  Use docstrings to document your functions and explain their purpose, arguments, and return values.
- **Testing:**  Use the same given data example of your code to demonstrate its functionality.
- **NOTE:** Answer in the same notebook with the given examples.


---

## 2.  Q1- Basic Statistics Calculator

**Objective:** Create a Python program that calculates basic statistics (mean, median, mode, range, variance, and standard deviation) for a set of numbers entered by the user.

**Requirements:**
Follow the TODOs below to complete each statistical calculation.


In [1]:
# Initialize an empty list to store numbers
temperatures = [23, 25, 20, 23, -5, 21, 18, 19, 24, 21,19, 24, 0, 
                20, 24, 55, 22, 50, 22, 20, 21, 22, 20, 25, 19, 
                22, 26, 23, 21, 23, 17, 20, 18]

In [9]:
Q1 = np.percentile(temperatures, 25)
Q3 = np.percentile(temperatures, 75)
print(Q1, Q3, np.median(temperatures))

20.0 23.0 21.0


In [4]:
def b_sort(values):
    length = len(values)
    for i in range(length - 1):
        for j in range(i, length):
            if values[j] < values[i]:
                values[i], values[j] = values[j], values[i]
    return values

In [39]:
# --- Mean Calculation ---
# TODO: Calculate the mean of the numbers in the 'temperatures' list.
#       Handle the case where the list is empty (return None or print a message).
#       Print the mean using an f-string.
# Your code here:
def mean(values):
    if values:
        m = round(sum(values) / len(values), 3)
        return m
    else:
        return None

In [4]:
print(f"Mean = {mean(temperatures)}")

Mean = 21.879


In [23]:
# --- Median Calculation ---
# TODO: Calculate the median of the numbers in the 'temperatures' list.
#       Sort the 'numbers' list first.
#       Handle both even and odd length lists.
#       Print the median using an f-string.
# Your code here:
def median(values):
    if values:
        n = len(values)
        sorted_values = b_sort(values)
        mid_index = int(n / 2)

        if n % 2 != 0:
            return sorted_values[mid_index]
        else:
            
            return (sorted_values[mid_index] + sorted_values[mid_index - 1]) / 2
    else:
        return None

In [24]:
print(f"Medain = {median(temperatures)}")

Medain = 21


In [13]:
# --- Mode Calculation (Simple) ---
# TODO: Calculate the mode (most frequent number) in the 'temperatures' list.
#       If there are multiple modes, you can return any one of them.
#       If there is no mode (all numbers appear once), you can return None or print a message.
#       Print the mode using an f-string.
# Your code here:
def mode(values):
    if values:
        freq = {}
        for v in values:
            if v in freq:
                freq[v] += 1
            else:
                freq[v] = 1

        max_freq = max(freq.values())
        modes = [m for m in set(values) if freq[m] == max_freq]

        if len(modes) == len(values):
            return None
        else:
            return modes[0]
    else:
        return None
    

In [14]:
print(f"Mode = {mode(temperatures)}")

Mode = 20


In [43]:
# --- Variance Calculation ---
# TODO: Calculate the variance of the numbers in the 'temperatures' list.
#       Use the formula for sample variance (divide by n-1 for sample, n for population).
#       For this exercise, assume it's sample variance if list has more than 1 element, else return 0 if list has 0 or 1 element.
#       Print the variance using an f-string.
# Your code here:
def var(values):
    n = len(values) 
    if n > 1:
        m = mean(values)
        v = round(sum([(x - m) ** 2 for x in values]) / (n - 1), 3)
        return v
    else:
        return 0

In [16]:
print(f"Variance = {var(temperatures)}")

Variance = 101.36


In [40]:
# --- Standard Deviation Calculation ---
# TODO: Calculate the standard deviation of the numbers in the 'temperatures' list.
#       Take the square root of the variance calculated above.
#       Print the standard deviation using an f-string.
# Your code here:
from math import sqrt
def std(values):
    v = var(values)
    
    return round(sqrt(v), 3)


In [18]:
print(f"Standard Deviation = {std(temperatures)}")

Standard Deviation = 10.068


In [2]:
# --- Range Calculation ---
# TODO: Calculate the range (difference between max and min) of the 'temperatures' list.
#       Print the range using an f-string.
# Your code here:
def Range(values):
    return max(values) - min(values)

In [3]:
print(f"Range = {Range(temperatures)}")

Range = 60


In [33]:
# --- IQR Calculation ---
# TODO: Calculate the Interquartile Range (IQR) of the 'temperatures' list.
#       Use the percentile function you defined in Q1 or define a new one if needed.
#       Calculate Q1 (25th percentile) and Q3 (75th percentile).
#       IQR = Q3 - Q1
#       Print the IQR using an f-string.
# Your code here:
def iqr(values):
    sorted_values = b_sort(values)
    n = len(values)
    if n % 2 == 0:
        lower_half = sorted_values[:n // 2]
        upper_half = sorted_values[n // 2:]
    else:
        lower_half = sorted_values[:n // 2]
        upper_half = sorted_values[n // 2+1:]
    Q1 = int(median(lower_half))
    Q3 = int(median(upper_half))
    return Q3 - Q1


In [35]:
print(f"IQR = {iqr(temperatures)}")

IQR = 4


In [None]:
# --- Skewness Calculation ---
# TODO: (Optional) Calculate the skewness of the 'temperatures' list.
#       You can use a simple method for skewness calculation.
#       Print the skewness using an f-string.
# Your code here:
def skewness(values):
    if values:
        n = len(values)
        m = mean(values)
        sd = std(values)
        s = round(sum([(x - m) ** 3 for x in values]) / ((n - 1) * sd ** 3), 3)
        return s
    else:
        return None

In [47]:
print(f"Skewness = {skewness(temperatures)}")

Skewness = 0.873


In [50]:
# --- Kurtosis Calculation ---
# TODO: Calculate the skewness of the 'temperatures' list.
#       You can use a simple method for skewness calculation.
#       Print the skewness using an f-string.
# Your code here:
def kurtosis(values):
    if values:
        n = len(values)
        m = mean(values)
        sd = std(values)
        s = round(sum([(x - m) ** 4 for x in values]) / ((n - 1) * sd ** 4), 3)
        return s
    else:
        return None

In [51]:
print(f"Kurtosis = {kurtosis(temperatures)}")

Kurtosis = 7.852


---
## Q2 - Repeat Q1 with Numpy 
**Objective:** Use numpy to calculate basic statistics (mean, median, mode, range, variance, and standard deviation) for a set of given numbers.

---
## Q2. Filtering Outliers


The basic approach is to identify potential outliers based on a defined threshold. A common method is to use the **Interquartile Range (IQR)**. Here's how it works:

1. Use python liberaries (don't hard-code the equations) to calculate basic statistics (mean, median, mode, range, variance, standard deviation, and IQR) for a set of given numbers.

2. **Define Outlier Boundaries:**
   - Lower Boundary: Q1 - 1.5 * IQR
   - Upper Boundary: Q3 + 1.5 * IQR

3. **Filter the List:**
   - Remove any values that fall outside the calculated boundaries.

In [7]:
import numpy as np
from scipy import stats # Import scipy.stats for mode and skewness if needed

# Given set of numbers
temperatures = np.array([25, 32, 45, 18, 60, 55, 48, 72, 23, 25, 20, 23, -5, 21, 18, 19, 24, 21,19, 24, 0,
                20, 24, 55, 22, 50, 22, 20, 21, 22, 20, 25, 19,
                22, 26, 23, 21, 23, 17, 20, 18])

In [None]:
# --- Mean Calculation with NumPy ---
# TODO: Calculate the mean of the temperatures array using numpy.
# Your code here:


# --- Median Calculation with NumPy ---
# TODO: Calculate the median of the temperatures array using numpy.
# Your code here:


# --- Mode Calculation with NumPy/SciPy ---
# TODO: Calculate the mode (most frequent number) using numpy or scipy.stats.
#       NumPy doesn't have a direct mode function. 
# Your code here:


# --- Variance Calculation with NumPy ---
# TODO: Calculate the variance of the temperatures array using numpy.
# Your code here:


# --- Standard Deviation Calculation with NumPy ---
# TODO: Calculate the standard deviation of the temperatures array using numpy.
# Your code here:


# --- Range Calculation with NumPy ---
# TODO: Calculate the range (difference between max and min) using numpy.
# Your code here:


# --- IQR Calculation with NumPy ---
# TODO: Calculate the Interquartile Range (IQR) using numpy.percentile().
#       Calculate Q1 (25th percentile) and Q3 (75th percentile).
#       IQR = Q3 - Q1
# Your code here:


# --- Skewness Calculation with SciPy ---
# TODO: Calculate the skewness using scipy.stats.skew().
# Your code here:

# --- Kurtosis Calculation with SciPy ---
# TODO: Calculate the skewness using scipy.stats.skew().
# Your code here:

In [None]:
# Simulated data for temperatures
temperatures = [23, 25, 20, 23, -5, 21, 18, 19, 24, 21,19, 24, 0, 
                20, 24, 55, 22, 50, 22, 20, 21, 22, 20, 25, 19, 
                22, 26, 23, 21, 23, 17, 20, 18]

# TODO: Calculate Q1 by calling the function with percentile=25 (use your previoly defined function)
q1 = None
print("Q1:", q1)

# TODO: Calculate Q2 by calling the function with percentile=50
q2 = None 
print("Q2:", q2)

# TODO: Calculate Q3 by calling the function with percentile=75
q3 = None 
print("Q3:", q3)

# TODO: Calculate IQR by subtracting Q1 from Q3
iqr = None
print("IQR:", iqr)

**Define the lower and upper bounds for outliers**

In [None]:
# TODO: Calculate lower bound using formula: Q1 - 1.5 * IQR
lower_bound = None
print("Lower Bound:", lower_bound)

# TODO: Calculate upper bound using formula: Q3 + 1.5 * IQR
upper_bound = None
print("Upper Bound:", upper_bound)


**Filter the outliers**

In [None]:
# TODO: Create a list comprehension that only includes values between the bounds
filtered_temperatures = None

**Output the results**

In [None]:
print("Original Temperatures:", temperatures)
print("Filtered Temperatures:", filtered_temperatures)

---
## Q3: Outlier Removal Function

Take the outlier filtering steps from the previous explanation (Q2) and encapsulate them into a reusable function. This function should be designed to be easily applied to different festures for consistent outlier removal and analysis.

In [25]:
# Simulated data 
temperatures = [23, 25, 23, -5, 18, 19, 24, 21, 19, 24, 0, 
                24, 55, 50, 20, 25, 22, 26, 23, 17, 18]

humidity = [60, 65, 72, 68, 75, 80, 82, 78, 62, 68, 71, 
            69, 77, 81, 79, 64, 69, 67,  74, 68, 75, 100] 

In [13]:
# TODO: Define a function to retraive only data without outliers. Use the previous percentiles function


In [None]:
# TODO: Apply the defined function to retraive only data without outliers. 

filtered_temperatures = None
filtered_humidity = None

**Output the results**

In [None]:
print("Original Temperatures:", sorted(temperatures))
print("Filtered Temperatures:", filtered_temperatures)
print('='*110)
print("Original Humidity:", sorted(humidity))
print("Filtered Humidity:", filtered_humidity)