#District Literacy Analysis

# Introduction

In this notebook, I will analyze literacy rate data for districts in a large nation. This dataset provides valuable insights into education levels across different regions. By exploring the distribution of literacy rates and identifying key patterns, this analysis will contribute to a better understanding of educational disparities and trends.


## Overview

In this notebook, I will explore and analyze literacy rate data for each district, applying various statistical techniques to summarize and interpret the dataset. To achieve this, I will perform the following analyses:

- Compute descriptive statistics to summarize district literacy rates.
- Use the normal distribution to model the data and identify patterns.
- Calculate z-scores to detect potential outliers.
- Simulate random sampling and estimate population means.
- Construct confidence intervals to assess the reliability of estimates.
- Conduct a two-sample hypothesis test to compare different district literacy rates.

By performing these analyses, I aim to gain a deeper understanding of literacy trends across districts, evaluate data distribution characteristics, and ensure the accuracy and reliability of the dataset for further research.


## Dataset Structure

### District Demographics Dataset
This dataset contains demographic and administrative information about various districts in a given state. Each record represents a district and includes details on its population, administrative divisions, and literacy rates. Below is a description of the key fields in this dataset:

- **DISTNAME**: The name of the district.
- **STATNAME**: The name of the state to which the district belongs.
- **BLOCKS**: The number of administrative blocks within the district.
- **VILLAGES**: The total number of villages in the district.
- **CLUSTERS**: The number of clusters present in the district.
- **TOTPOPULAT**: The total population of the district.
- **OVERALL_LI**: The overall literacy rate of the district (expressed as a percentage).


## Importing Required Libraries
Before beginning the analysis, it is essential to import all necessary libraries. 

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats
import statsmodels.api as sm

# Exploring Data Insights: Computing Descriptive Statistics with Python

## Introduction

In this section, I will compute descriptive statistics to explore and summarize the literacy rate data for each district in the education dataset. Descriptive statistics provide key insights into the distribution, central tendency, and variability of the data, helping me better understand patterns and trends. This step follows data cleaning and serves as a crucial part of exploratory data analysis (EDA), allowing me to assess the dataset before moving on to more complex analyses. By summarizing the dataset numerically, I can identify potential outliers, detect inconsistencies, and gain a clearer picture of the overall data structure.

I will load the dataset and display a sample of the data.

In [3]:
education_districtwise = pd.read_csv(r'C:\Users\saswa\Documents\GitHub\District-Literacy-Analysis\Data\education_districtwise.csv')

### Exploring the Data

To get a quick overview of the dataset, I use the `head()` function. This allows me to see the first few rows and understand the structure of the data.

In [4]:
education_districtwise.head(10)

Unnamed: 0,DISTNAME,STATNAME,BLOCKS,VILLAGES,CLUSTERS,TOTPOPULAT,OVERALL_LI
0,DISTRICT32,STATE1,13,391,104,875564.0,66.92
1,DISTRICT649,STATE1,18,678,144,1015503.0,66.93
2,DISTRICT229,STATE1,8,94,65,1269751.0,71.21
3,DISTRICT259,STATE1,13,523,104,735753.0,57.98
4,DISTRICT486,STATE1,8,359,64,570060.0,65.0
5,DISTRICT323,STATE1,12,523,96,1070144.0,64.32
6,DISTRICT114,STATE1,6,110,49,147104.0,80.48
7,DISTRICT438,STATE1,7,134,54,143388.0,74.49
8,DISTRICT610,STATE1,10,388,80,409576.0,65.97
9,DISTRICT476,STATE1,11,361,86,555357.0,69.9


Each row in this dataset represents a district, not a state or village. Some key columns I note:

- `VILLAGES`: Number of villages in each district

- `TOTPOPULAT`: Population of each district

- `OVERALL_LI`: Literacy rate for each district

Understanding this hierarchy helps me analyze the data correctly.

#### Computing Descriptive Statistics

Once I have an idea of the dataset, I use Python’s `describe()` function to compute key statistics. This function is useful because it provides multiple descriptive statistics at once.

In [None]:
education_districtwise['OVERALL_LI'].describe()

Here’s what each metric means:

- `count`: Total number of districts with available literacy rate data

- `mean`: Average literacy rate across all districts

- `std`: Standard deviation, showing how much the data varies

- `min` and `max`: Lowest and highest literacy rates

- `25%`, `50%`, `75%`: Percentile values that help understand data distribution

From the output, I learn that the average literacy rate is about 73%, which helps me compare individual districts against the national trend.

#### Analyzing Categorical Data

I also explore categorical data using `describe()`. For example, I check how many unique states are represented and which state appears the most:

In [5]:
education_districtwise['STATNAME'].describe()

count         680
unique         36
top       STATE21
freq           75
Name: STATNAME, dtype: object

The dataset contains 36 unique states. The most common state (`STATE21`) appears in 75 districts, meaning it has more districts compared to other states. This information could be useful in identifying states with a higher need for educational resources.

#### Calculating Range in Literacy Rate

To measure the spread of literacy rates, I calculate the range using the max() and min() functions.

In [None]:
range_overall_li = education_districtwise['OVERALL_LI'].max() - education_districtwise['OVERALL_LI'].min()
range_overall_li

The result shows a range of about 61.5 percentage points, indicating significant differences in literacy rates among districts. This variation highlights disparities in education levels across different regions.

### Key Takeaways

- The dataset contains literacy rate data for different districts, helping in identifying educational disparities.

- The average literacy rate is about 73%, providing a benchmark for comparison.

- Some districts have significantly lower literacy rates than others, with a range of 61.5 percentage points.

- The most common state in the dataset includes 75 districts, showing variation in state-wise district representation.