### About dataset
This dataset represents a list of school districts in an anonymous country. The data includes district and state names, total population, and the literacy rate.

The dataset contains:

680 rows – each row is a different school district

| Column Name | Type   | Description                                                |
|-------------|--------|------------------------------------------------------------|
| DISTNAME    | str    | The names of an anonymous country’s school districts      |
| STATNAME    | str    | The names of an anonymous country’s states                 |
| BLOCKS      | int64  | The number of blocks in the school district. Blocks are the smallest organizational structure in the education system of the anonymous country.                |
| VILLAGES    | int64  | The number of villages in each district                    |
| CLUSTERS    | int64  | The number of clusters in the school district              |
| TOTPOPULAT  | float64  | The population for each district                           |
| OVERALL_LI  | float64  | The literacy rate for each district                        |

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

In [3]:
df = pd.read_csv('https://raw.githubusercontent.com/ManonYa09/Statistics_with_Python_G7/main/Dataset/0.%20education_districtwise.csv')

In [None]:
Variable = Column = Feature

### 1. Numberical data
#### Numberical and Continous 
TOTPOPULAT,OVERALL_LI 
#### Numberical Discrete
VILLAGES 

### 2. Categorical Data

### 3. Frequency Distribution

#### 3.1 Number Data as continuous 

In [27]:
df['TOTPOPULAT'].describe()

count    6.340000e+02
mean     1.899024e+06
std      1.547475e+06
min      7.948000e+03
25%      8.226940e+05
50%      1.564392e+06
75%      2.587520e+06
max      1.105413e+07
Name: TOTPOPULAT, dtype: float64

In [39]:
num_rows1 , num_columns1 = df.shape

In [47]:
df.shape

(680, 7)

In [43]:
num_rows1

680

In [45]:
num_columns1

7

In [41]:
num_rows1 ==num_columns1

False

In [79]:
dataset.shape

(634, 7)

In [101]:
np.arange(0, 10.0001, 2)

array([ 0.,  2.,  4.,  6.,  8., 10.])

In [113]:
dataset['DISTNAME'].nunique()

634

In [105]:
# fig, ax = plt.subplots(figsize = (12, 6))
# sns.histplot(x = df['TOTPOPULAT'], bins= np.arange(0, 12*10**6, 10**6))
# ax.set_title('Distribution of Total Population', fontweight = 'bold')
# ax.set_xticks(np.arange(0, 12*10**6, 10**6))
# ax.set_xlim(0, 11*10**6)
# ax.bar_label(ax.containers[1], color = 'r')
# sns.despine()
# # plt.savefig('testing.jpg', dpi = 2000);

In [119]:
dataset['OVERALL_LI'].describe()

count    634.000000
mean      73.395189
std       10.098460
min       37.220000
25%       66.437500
50%       73.490000
75%       80.815000
max       98.760000
Name: OVERALL_LI, dtype: float64

In [None]:
sns.despine

#### 3.2 Discrete

In [246]:
df['VILLAGES'].describe()

count     680.000000
mean      874.614706
std       622.711657
min         6.000000
25%       390.750000
50%       785.500000
75%      1204.250000
max      3963.000000
Name: VILLAGES, dtype: float64

In [248]:
df['VILLAGES'].nunique()

576

In [262]:
# df['VILLAGES'].value_counts()

### 3.3 Categorical

In [171]:
df1 = pd.read_csv('/Users/macbook/Desktop/Fundamental Data Science G8/Data Visualization/finall_3data_exericse_with_month.csv')

In [178]:
df1 = df1[['month', 'sum']]

In [185]:
df1.rename(columns={'sum':'Frequency'}, inplace=True)

In [209]:
df1.head(2)

Unnamed: 0,month,Frequency
0,Jan,860045
1,Feb,2071315


### 4. Measures of Central Tendency
#### 4.1 Mean or Average
- It is calculated by dividing the sum of all values by the count of all observations
- It can only be applied to numerical variables (not categorical)

`Noted` The main limitation of mean is that it is sensitive to outlier (extreme values)

<img src="https://raw.githubusercontent.com/ManonYa09/Statistics_with_Python_G7/main/Part%201%20Introduction%20to%20Statistics/Distribution%20of%20Literacy%20Rate.jpg" width="100%" style="display: block; margin: 0 auto;">

In [None]:
25% of len(df) 680

In [309]:
df['OVERALL_LI'].describe()

count    634.000000
mean      73.395189
std       10.098460
min       37.220000
25%       66.437500
50%       73.490000
75%       80.815000
max       98.760000
Name: OVERALL_LI, dtype: float64

In [307]:
sum([100, 100, 1000000, 1000000, 300, 400, 500])/7

285914.28571428574

In [299]:
df['OVERALL_LI'].mean()

73.39518927444796

#### 4.2 Median

<img src="https://raw.githubusercontent.com/ManonYa09/Statistics_with_Python_G7/main/%20Slides/photos/2.%20Median.jpg" alt="Control Structure" width="65%" style="display: block; margin: 0 auto;">

In [None]:
# df[(df['VILLAGES']>= df['VILLAGES'].median()) & (df['VILLAGES']<=df['VILLAGES'].max())].shape[0]

#### 4.3 Mode 

## 5. Variability

### 5.1 Measures of Spread

#### 5.1.1 Range 

The **range** is the difference between the largest and smallest value in a dataset. 

#### 5.1.2 Standard Deviation

#### 5.1.3 Variance

### 6. Measures of Position 

#### 6.1 Perceptile

<img src="https://online.stat.psu.edu/public/stat800/lesson04/500%20l1%2025th%20and%2075th%20percentile.png" width="50%" style="display: block; margin: 0 auto;">

### 6.2 Quartile

divides the values in a dataset into four equal parts

<img src="https://raw.githubusercontent.com/ManonYa09/Statistics_with_Python_G7/main/%20Slides/photos/3.%20Interqualtile%20range.jpg" width="60%" style="display: block; margin: 0 auto;">

In [None]:
bins = np.arange(0, 12*10**6, 10**6)

hist, edges = np.histogram(df['TOTPOPULAT'], bins=bins)

hist

In [None]:
edges