## Overview

Whenever a data professional works with a new dataset, the first step is to understand the context of the data during the discovering stage. Often, this involves discussing the data with project stakeholders and reading documentation about the dataset and the data collection process. After that, the data professional moves on to data cleaning and addresses issues like missing data, incorrect values, and irrelevant data. Computing descriptive stats is a common step to take after data cleaning.

In [1]:
import pandas as pd 
import numpy as np 
import datetime as dt 
import matplotlib.pyplot as plt

### Case of Study
I'm working for the government of a large nation, the government department of Education is seeking to understand the current literacy rates across the country.

##### Literacy Rate: 
The percentage of the population of a given age group that can read and write. 

### My Task:
Analyze data about the literacy rate among primary and secondary students. These are students who range in age from 6-18 years old.
I'll use descriptive stats to get a basic understanding of the literacy rate data for each district. 

In [3]:
edu_raw = pd.read_csv("./Raw_data/education_districtwise.csv")

edu_raw.head()

Unnamed: 0,DISTNAME,STATNAME,BLOCKS,VILLAGES,CLUSTERS,TOTPOPULAT,OVERALL_LI
0,DISTRICT32,STATE1,13,391,104,875564.0,66.92
1,DISTRICT649,STATE1,18,678,144,1015503.0,66.93
2,DISTRICT229,STATE1,8,94,65,1269751.0,71.21
3,DISTRICT259,STATE1,13,523,104,735753.0,57.98
4,DISTRICT486,STATE1,8,359,64,570060.0,65.0


In [4]:
edu_raw.shape

(680, 7)

In [5]:
edu_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 680 entries, 0 to 679
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   DISTNAME    680 non-null    object 
 1   STATNAME    680 non-null    object 
 2   BLOCKS      680 non-null    int64  
 3   VILLAGES    680 non-null    int64  
 4   CLUSTERS    680 non-null    int64  
 5   TOTPOPULAT  634 non-null    float64
 6   OVERALL_LI  634 non-null    float64
dtypes: float64(2), int64(3), object(2)
memory usage: 37.3+ KB


In [7]:
edu_raw.describe().round(2)

Unnamed: 0,BLOCKS,VILLAGES,CLUSTERS,TOTPOPULAT,OVERALL_LI
count,680.0,680.0,680.0,634.0,634.0
mean,10.76,874.61,121.23,1899024.13,73.4
std,9.59,622.71,94.04,1547475.45,10.1
min,1.0,6.0,1.0,7948.0,37.22
25%,5.0,390.75,56.75,822694.0,66.44
50%,8.0,785.5,101.0,1564392.5,73.49
75%,13.0,1204.25,162.5,2587519.75,80.82
max,66.0,3963.0,592.0,11054131.0,98.76


In [9]:
edu_raw["OVERALL_LI"].describe().round(2)

count    634.00
mean      73.40
std       10.10
min       37.22
25%       66.44
50%       73.49
75%       80.82
max       98.76
Name: OVERALL_LI, dtype: float64

The dataset has 680 rows, after checking the describe funtion over the "OVERALL_LI" the count is 634, therefore there are some 
missing values for that column. 

In [10]:
edu_overall_li_nulls = edu_raw[ edu_raw["OVERALL_LI"].isnull()]
edu_overall_li_nulls

Unnamed: 0,DISTNAME,STATNAME,BLOCKS,VILLAGES,CLUSTERS,TOTPOPULAT,OVERALL_LI
54,DISTRICT302,STATE26,5,510,61,,
55,DISTRICT276,STATE26,6,393,59,,
200,DISTRICT588,STATE21,10,951,71,,
205,DISTRICT535,STATE21,13,1050,99,,
206,DISTRICT218,STATE21,8,341,47,,
207,DISTRICT258,STATE21,6,342,53,,
266,DISTRICT303,STATE3,4,62,5,,
267,DISTRICT608,STATE3,3,160,13,,
268,DISTRICT62,STATE3,4,145,7,,
269,DISTRICT474,STATE3,4,91,11,,


In [11]:
edu_overall_li_nulls.shape

(46, 7)

#### function describre() for a Categorical Column

* Number of unique values
* Mode: most common value
* The frequency of the mode

In [14]:
edu_raw[["STATNAME","DISTNAME"]].describe()

Unnamed: 0,STATNAME,DISTNAME
count,680,680
unique,36,680
top,STATE21,DISTRICT341
freq,75,1


In [18]:
range_overall_li = edu_raw["OVERALL_LI"].max() - edu_raw["OVERALL_LI"].min()
print(f"The Max level of Literacy is: {edu_raw["OVERALL_LI"].max()}")
print(f"The Minimun level of literacy is: {edu_raw["OVERALL_LI"].min()}")
print(f"The range of literacy across the country is: {range_overall_li.round(2)}")

The Max level of Literacy is: 98.76
The Minimun level of literacy is: 37.22
The range of literacy across the country is: 61.54
