### About dataset
This dataset represents a list of school districts in an anonymous country. The data includes district and state names, total population, and the literacy rate.

The dataset contains:

680 rows – each row is a different school district

| Column Name | Type   | Description                                                |
|-------------|--------|------------------------------------------------------------|
| DISTNAME    | str    | The names of an anonymous country’s school districts      |
| STATNAME    | str    | The names of an anonymous country’s states                 |
| BLOCKS      | int64  | The number of blocks in the school district. Blocks are the smallest organizational structure in the education system of the anonymous country.                |
| VILLAGES    | int64  | The number of villages in each district                    |
| CLUSTERS    | int64  | The number of clusters in the school district              |
| TOTPOPULAT  | float64  | The population for each district                           |
| OVERALL_LI  | float64  | The literacy rate for each district                        |

In [None]:
##

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
df = pd.read_csv('education_districtwise.csv')

#### Range
The **range** is the difference between the largest and smallest value in a dataset. 

#### Variance 
The average of the spuared difference of each data point from the mean.
#### Standard Deviation
Measures how spread out your values are from the mean of your dataset.

    - Low, or small, standard deviation indicates data are clustered tightly around the mean, and high, or large, standard deviation indicates data are more spread out.
<img src="https://raw.githubusercontent.com/ManonYa09/Statistics_with_Python_G8/main/Slides/photos/photo_2023-11-11_10-46-48.jpg" alt="Control Structure" width="75%" style="display: block; margin: 0 auto;">

**Exercise:** 
```py
data = [56, 65, 74, 75, 76, 77, 80, 81, 91]
```
- Find Mean 
- Find std Population

In [22]:
data = np.array([56, 65, 74, 75, 76, 77, 80, 81, 91])
print('The average of our data:', data.mean())
print('The STD of our data:', np.round(data.std(), 2))

The average of our data: 75.0
The STD of our data: 9.33


### 3. Measures of Position 



**Measures of Position**  let you determine the position of a value in relation to other values in a dataset.

#### 3.1 Percentile 

#### What Is a Percentile in Statistics?

Percentiles: are value that separates a set of data into **100** equal parts. we can use $ P_1, P_2, P_3, ... P_{99}$

![image.png](attachment:image.png)

### Quartile

divides the values in a dataset into four equal parts


<img src="https://online.stat.psu.edu/public/stat800/lesson04/500%20l1%2025th%20and%2075th%20percentile.png" alt="Control Structure" width="75%" style="display: block; margin: 0 auto;">

#### Boxplot

#### Using visualizations


<img src="https://raw.githubusercontent.com/ManonYa09/Statistics_with_Python_G8/main/Slides/photos/Box-Plot-and-Whisker-Plot-1.png" alt="Control Structure" width="75%" style="display: block; margin: 0 auto;">

[Deials](https://www.scribbr.com/statistics/outliers/)

### Outlier
Outliers are values at the extreme ends of a dataset.

Some outliers represent true values from natural variation in the population. Other outliers may result from incorrect data entry, equipment malfunctions, or other measurement errors.



In [27]:
def outlier_detection(df):
    """
    Detect potential outliers in a DataFrame using the Interquartile Range (IQR) method.

    Parameters:
    - df (pandas.DataFrame): The input DataFrame containing numerical data.

    Returns:
    - pandas.Series: A Series containing potential outlier values.
    
    The function calculates the first quartile (Q1), third quartile (Q3), and the Interquartile Range (IQR).
    It then identifies potential outliers below the lower bound (Q1 - 1.5 * IQR) or above the upper bound (Q3 + 1.5 * IQR).
    The result is a Series containing the values in the DataFrame that are potential outliers.

    Example:
    >>> data = {'Column1': [2, 4, 5, 7, 8, 9, 10, 11, 12, 50]}
    >>> df = pd.DataFrame(data)
    >>> outlier_detection(df['Column1'])
    Returns:
    9    50
    Name: Column1, dtype: int64
    """
    Q1 = df.quantile(0.25)
    Q3 = df.quantile(0.75)
    IQR = Q3 - Q1
    upper_end = Q3 + 1.5 * IQR
    lower_end = Q1 - 1.5 * IQR
    outliers = df[(df > upper_end) | (df < lower_end)]
    return outliers
