# Assignment 1:  Unlocking Statistical Insights with Python
---

Welcome to Assignment 1! In this assignment, you'll dive into the world of statistical analysis using Python.  We'll move beyond theoretical concepts and focus on practical application. You'll learn how to calculate key statistical measures and use them for real-world tasks like outlier detection.

This assignment will guide you through:

* **Revise the concepts and equations:**  You'll implement code from scratch to compute essential statistics like mean, median, mode, variance, standard deviation, range, and interquartile range (IQR).
* **Calculating Basic Statistics:**  You'll implement functions (and use powerful libraries like NumPy and SciPy) to compute essential statistics like mean, median, mode, variance, standard deviation, range, and interquartile range (IQR).
* **Outlier Detection:** You'll apply the Interquartile Range (IQR) method to identify and filter outliers from datasets, a crucial step in data cleaning and analysis.

By the end of this assignment, you'll have a strong foundation in performing statistical calculations in Python and applying these techniques to understand and refine data.

#### General Instructions 

Please adhere to the following guidelines:

- **Code Clarity:** Your code should be well-formatted, easy to understand, and include meaningful variable names.
- **Docstrings:**  Use docstrings to document your functions and explain their purpose, arguments, and return values.
- **Testing:**  Use the same given data example of your code to demonstrate its functionality.
- **NOTE:** Answer in the same notebook with the given examples.


---

## 2.  Q1- Basic Statistics Calculator

**Objective:** Create a Python program that calculates basic statistics (mean, median, mode, range, variance, and standard deviation) for a set of numbers entered by the user.

**Requirements:**
Follow the TODOs below to complete each statistical calculation.


In [1]:
import numpy as np

In [2]:
# Initialize an empty list to store numbers
temperatures = [23, 25, 20, 23, -5, 21, 18, 19, 24, 21, 19, 24, 0, 
                20, 24, 55, 22, 50, 22, 20, 21, 22, 20, 25, 19, 
                22, 26, 23, 21, 23, 17, 20]



### Mean

The mean = The sum of the elements / The count of the elements

In [3]:
# --- Mean Calculation ---
# TODO: Calculate the mean of the numbers in the 'temperatures' list.
#       Handle the case where the list is empty (return None or print a message).
#       Print the mean using an f-string.
# Your code here:
def calc_mean(list1):
    x = 0
    if len(list1) != 0:
        for i in list1:
            x = x + i
        return f"The mean of the input is {x/len(list1)}"
        
    else:
        return "The list is empty."

In [4]:
print(calc_mean(temperatures),"\n")
empty_list = []
print(calc_mean(empty_list))

The mean of the input is 22.0 

The list is empty.


### Median

If the number of observations is odd, the number in the middle of the list is the median. This can be found by taking the value of the (n+1)/2 -th term, where n is the number of observations.

Else, if the number of observations is even, then the median is the simple average of the middle two numbers. In calculation, the median is the simple average of the n/2 -th and the (n/2 + 1) -th terms.

In [5]:
temperatures_even = sorted([23, 25, 20, 23, -5, 21, 18, 19, 24, 21, 19, 24, 0, 
                    20, 24, 55, 22, 50, 22, 20, 21, 22, 20, 25, 19, 
                    22, 26, 23, 21, 23, 17, 20])
temperatures_odd = sorted([23, 25, 20, 23, -5, 21, 18, 19, 24, 21, 19, 24, 0, 
                    20, 24, 55, 22, 50, 22, 20, 21, 22, 20, 25, 19, 
                    22, 26, 23, 21, 23, 17, 20 , 15])

In [6]:
first = int(len(temperatures_even) / 2)
second = int(len(temperatures_even) / 2) - 1
print("The two values in the middle: ",temperatures_even[first], temperatures_even[second])


The two values in the middle:  22 21


In [7]:
# --- Median Calculation ---
# TODO: Calculate the median of the numbers in the 'temperatures' list.
#       Sort the 'numbers' list first.
#       Handle both even and odd length lists.
#       Print the median using an f-string.
# Your code here:


def calc_median(list1):
    first = int(len(list1) / 2) - 1 # because index starts at zero
    second = int(len(list1) / 2)

    if (len(list1) % 2) == 0: #even number of observations
        med = (list1[first] + list1[second]) / 2
        return f"{med}"
    
    else:
        med = list1[first]
        return f"{med}"




In [8]:
print("The median with an even list:",calc_median(temperatures_even))
print("The median with an odd list:",calc_median(temperatures_odd))

The median with an even list: 21.5
The median with an odd list: 21


### Mode

In [9]:
# --- Mode Calculation (Simple) ---
# TODO: Calculate the mode (most frequent number) in the 'temperatures' list.
#       If there are multiple modes, you can return any one of them.
#       If there is no mode (all numbers appear once), you can return None or print a message.
#       Print the mode using an f-string.
# Your code here:


def calc_mode(list1):
    occ_dict = dict()
    for i in list1:
        if i not in occ_dict:
            occ_dict[i] = 1
        else:
            occ_dict[i] += 1
    

    # Sort based on reverse of Values
    val_based_rev = {k: v for k, v in sorted(occ_dict.items(), key=lambda item: item[1], reverse=True)}
    return f"The mode is: {list(val_based_rev.keys())[0]}"

calc_mode(temperatures)

'The mode is: 20'

### Variance

In [10]:
# --- Variance Calculation ---
# TODO: Calculate the variance of the numbers in the 'temperatures' list.
#       Use the formula for sample variance (divide by n-1 for sample, n for population).
#       For this exercise, assume it's sample variance if list has more than 1 element, else return 0 if list has 0 or 1 element.
#       Print the variance using an f-string.
# Your code here:
def calc_mean(list1):
    x = 0
    if len(list1) != 0:
        for i in list1:
            x = x + i
        return x/len(list1)
        
    else:
        return "The list is empty."

def calc_variance(list1):
    x = 0
    for i in list1:
        x = x + ((i - calc_mean(list1)) ** 2) / (len(list1) - 1)
    return x

calc_variance(temperatures)


104.12903225806453

### Standard Deviation

In [11]:
# --- Standard Deviation Calculation ---
# TODO: Calculate the standard deviation of the numbers in the 'temperatures' list.
#       Take the square root of the variance calculated above.
#       Print the standard deviation using an f-string.
# Your code here:

def calc_std(list1):
    return calc_variance(list1) ** 0.5 #Square root

calc_std(temperatures)

10.204363393081634

### Range

In [12]:
# --- Range Calculation ---
# TODO: Calculate the range (difference between max and min) of the 'temperatures' list.
#       Print the range using an f-string.
# Your code here:

def calc_range(list1):
    sorted_list = sorted(list1)
    return sorted_list[-1] - sorted_list[0]

calc_range(temperatures)

60

### IQR

In [13]:
# --- IQR Calculation ---
# TODO: Calculate the Interquartile Range (IQR) of the 'temperatures' list.
#       Use the percentile function you defined in Q1 or define a new one if needed.
#       Calculate Q1 (25th percentile) and Q3 (75th percentile).
#       IQR = Q3 - Q1
#       Print the IQR using an f-string.
# Your code here:
def calc_IQR(list1):
    # Q1 (First Quartile): The median of the first half of the dataset (excluding the median if the number of data points is odd). For [2,4,4], Q1 = 4
    first_half = list1[0: int(len(list1) // 2)]

    #Q3 (Third Quartile): The median of the second half of the dataset. For [6,8,9], Q3 = 8
    second_half = list1[int(len(list1) // 2) + 1 :]

    Q1 = float(calc_median(first_half)) #Since the function returns an fstring
    Q3 = float(calc_median(second_half)) 

    IQR = Q3 - Q1
    return f'{IQR}'

calc_IQR(temperatures_even)

'3.0'

### Skewness

In [14]:
# --- Skewness Calculation ---
# TODO: (Optional) Calculate the skewness of the 'temperatures' list.
#       You can use a simple method for skewness calculation.
#       Print the skewness using an f-string.
# Your code here:

def calc_skewness(list1):
    if calc_mean(list1) > float(calc_median(list1)):
        return f'Positive skewness: Data is skewed right (long tail on the right)'
    elif calc_mean(list1) < float(calc_median(list1)):
        return f'Negative skewness: Data is skewed left (long tail on the left)'
    else:
        return f'Zero skewness, Data is symmetrical'
    
calc_skewness(temperatures)

'Negative skewness: Data is skewed left (long tail on the left)'

### Kurtosis

In [15]:
# --- Kurtosis Calculation ---
# TODO: Calculate the skewness of the 'temperatures' list.
#       You can use a simple method for skewness calculation.
#       Print the skewness using an f-string.
# Your code here:
from scipy.stats import norm, kurtosis

f'Kurtosis is equal to {kurtosis(temperatures)}'

def calc_kurtosis(list1):
    k = kurtosis(list1)
    if k > 0:
        return f"Kurtosis is equal to : {round(k, 2)}, which means that it's flat and wide"
    elif k < 0:
        return f"Kurtosis is equal to : {round(k, 2)}, which means that it's sharp and tall"
    else:
        return f"Kurtosis is equal to : {round(k, 2)}, which means that it's normal"

calc_kurtosis(temperatures)

"Kurtosis is equal to : 4.89, which means that it's flat and wide"

---
## Q2 - Repeat Q1 with Numpy 
**Objective:** Use numpy to calculate basic statistics (mean, median, mode, range, variance, and standard deviation) for a set of given numbers.

---
## Q2. Filtering Outliers


The basic approach is to identify potential outliers based on a defined threshold. A common method is to use the **Interquartile Range (IQR)**. Here's how it works:

1. Use python liberaries (don't hard-code the equations) to calculate basic statistics (mean, median, mode, range, variance, standard deviation, and IQR) for a set of given numbers.

2. **Define Outlier Boundaries:**
   - Lower Boundary: Q1 - 1.5 * IQR
   - Upper Boundary: Q3 + 1.5 * IQR

3. **Filter the List:**
   - Remove any values that fall outside the calculated boundaries.

In [16]:
import numpy as np
from scipy import stats # Import scipy.stats for mode and skewness if needed

# Given set of numbers
temperatures = np.array([25, 32, 45, 18, 60, 55, 48, 72, 23, 25, 20, 23, -5, 21, 18, 19, 24, 21,19, 24, 0,
                20, 24, 55, 22, 50, 22, 20, 21, 22, 20, 25, 19,
                22, 26, 23, 21, 23, 17, 20, 18])

In [17]:
# --- Mean Calculation with NumPy ---
# TODO: Calculate the mean of the temperatures array using numpy.
# Your code here:
arr = np.array(temperatures)
np.mean(arr)

# --- Median Calculation with NumPy ---
# TODO: Calculate the median of the temperatures array using numpy.
# Your code here:
arr1 = np.array(temperatures_even) 
arr2 = np.array(temperatures_odd)

np.median(arr1)
np.median(arr2)

# --- Mode Calculation with NumPy/SciPy ---
# TODO: Calculate the mode (most frequent number) using numpy or scipy.stats.
#       NumPy doesn't have a direct mode function. 
# Your code here:
from scipy import stats as st
st.mode(arr)

# --- Variance Calculation with NumPy ---
# TODO: Calculate the variance of the temperatures array using numpy.
# Your code here:
np.var(temperatures)

# --- Standard Deviation Calculation with NumPy ---
# TODO: Calculate the standard deviation of the temperatures array using numpy.
# Your code here:
np.std(temperatures)

# --- Range Calculation with NumPy ---
# TODO: Calculate the range (difference between max and min) using numpy.
# Your code here:
np.max(temperatures) - np.min(temperatures)

# --- IQR Calculation with NumPy ---
# TODO: Calculate the Interquartile Range (IQR) using numpy.percentile().
#       Calculate Q1 (25th percentile) and Q3 (75th percentile).
#       IQR = Q3 - Q1
# Your code here:
np.quantile(temperatures, 0.75) - np.quantile(temperatures, 0.25)

# --- Skewness Calculation with SciPy ---
# TODO: Calculate the skewness using scipy.stats.skew().
# Your code here:
st.skew(temperatures)

# --- Kurtosis Calculation with SciPy ---
# TODO: Calculate the skewness using scipy.stats.skew().
# Your code here:
st.kurtosis(temperatures)


np.float64(1.7896077524028078)

In [18]:
# Simulated data for temperatures
temperatures = [23, 25, 20, 23, -5, 21, 18, 19, 24, 21,19, 24, 0, 
                20, 24, 55, 22, 50, 22, 20, 21, 22, 20, 25, 19, 
                22, 26, 23, 21, 23, 17, 20, 18]

# TODO: Calculate Q1 by calling the function with percentile=25 (use your previoly defined function)
q1 = np.quantile(temperatures,0.25)
print("Q1:", q1)

# TODO: Calculate Q2 by calling the function with percentile=50
q2 = np.quantile(temperatures,0.5) 
print("Q2:", q2)

# TODO: Calculate Q3 by calling the function with percentile=75
q3 = np.quantile(temperatures,0.75) 
 
print("Q3:", q3)

# TODO: Calculate IQR by subtracting Q1 from Q3
iqr = q3 - q1
print("IQR:", iqr)

Q1: 20.0
Q2: 21.0
Q3: 23.0
IQR: 3.0


**Define the lower and upper bounds for outliers**

In [19]:
# TODO: Calculate lower bound using formula: Q1 - 1.5 * IQR
lower_bound = q1 - 1.5 * iqr
print("Lower Bound:", lower_bound)

# TODO: Calculate upper bound using formula: Q3 + 1.5 * IQR
upper_bound = q3 + 1.5 * iqr
print("Upper Bound:", upper_bound)


Lower Bound: 15.5
Upper Bound: 27.5


**Filter the outliers**

In [20]:
# TODO: Create a list comprehension that only includes values between the bounds
filtered_temperatures = [i for i in temperatures if (i < upper_bound) & (i > lower_bound)]

**Output the results**

In [21]:
print("Original Temperatures:", temperatures)
print("Filtered Temperatures:", filtered_temperatures)

Original Temperatures: [23, 25, 20, 23, -5, 21, 18, 19, 24, 21, 19, 24, 0, 20, 24, 55, 22, 50, 22, 20, 21, 22, 20, 25, 19, 22, 26, 23, 21, 23, 17, 20, 18]
Filtered Temperatures: [23, 25, 20, 23, 21, 18, 19, 24, 21, 19, 24, 20, 24, 22, 22, 20, 21, 22, 20, 25, 19, 22, 26, 23, 21, 23, 17, 20, 18]


---
## Q3: Outlier Removal Function

Take the outlier filtering steps from the previous explanation (Q2) and encapsulate them into a reusable function. This function should be designed to be easily applied to different festures for consistent outlier removal and analysis.

In [22]:
# Simulated data 
temperatures = [23, 25, 23, -5, 18, 19, 24, 21, 19, 24, 0, 
                24, 55, 50, 20, 25, 22, 26, 23, 17, 18]

humidity = [60, 65, 72, 68, 75, 80, 82, 78, 62, 68, 71, 
            69, 77, 81, 79, 64, 69, 67,  74, 68, 75, 100] 

In [23]:
# TODO: Define a function to retraive only data without outliers. Use the previous percentiles function
def clip_outliers(list1):
    q1 = np.quantile(list1, 0.25)
    q2 = np.quantile(list1, 0.5) 
    q3 = np.quantile(list1, 0.75) 
    
    iqr = q3 - q1

    lower_bound = q1 - 1.5 * iqr
    upper_bound = q3 + 1.5 * iqr
    
    return [i for i in list1 if (i < upper_bound) & (i > lower_bound)]

In [24]:
# TODO: Apply the defined function to retraive only data without outliers. 

filtered_temperatures = clip_outliers(temperatures)
filtered_humidity = clip_outliers(humidity)

**Output the results**

In [25]:
print("Original Temperatures:", sorted(temperatures))
print("Filtered Temperatures:", filtered_temperatures)
print('='*110)
print("Original Humidity:", sorted(humidity))
print("Filtered Humidity:", filtered_humidity)

Original Temperatures: [-5, 0, 17, 18, 18, 19, 19, 20, 21, 22, 23, 23, 23, 24, 24, 24, 25, 25, 26, 50, 55]
Filtered Temperatures: [23, 25, 23, 18, 19, 24, 21, 19, 24, 24, 20, 25, 22, 26, 23, 17, 18]
Original Humidity: [60, 62, 64, 65, 67, 68, 68, 68, 69, 69, 71, 72, 74, 75, 75, 77, 78, 79, 80, 81, 82, 100]
Filtered Humidity: [60, 65, 72, 68, 75, 80, 82, 78, 62, 68, 71, 69, 77, 81, 79, 64, 69, 67, 74, 68, 75]
