# Codebook  
**Authors:** Lauren Baker  
Documenting existing data files of DaanMatch with information about location, owner, "version", source etc.

In [31]:
import boto3
import numpy as np 
import pandas as pd
pd.plotting.register_matplotlib_converters()
import matplotlib.pyplot as plt
%matplotlib inline
from collections import Counter
import statistics

In [44]:
client = boto3.client('s3')
resource = boto3.resource('s3')
my_bucket = resource.Bucket('my-bucket')

# Districts--.csv
## TOC:
* [About this dataset](#1)
* [What's in this dataset](#2)
* [Codebook](#3)
    * [Missing values](#3.1)
    * [Summary statistics](#3.2)
* [Columns](#4)
    * [Name](#4.1)
    * [Value](#4.2)

**About this dataset**  <a class="anchor" id="1"></a>  
Data provided by: Unknown.    
Source: https://daanmatchdatafiles.s3-us-west-1.amazonaws.com/DaanMatch_DataFiles/Districts--.csv  
Type: csv  
Last Modified: May 29, 2021, 19:54:25 (UTC-07:00)  
Size: 11.6 KB

In [33]:
path = "s3://daanmatchdatafiles/DaanMatch_DataFiles/Districts--.csv"
districts = pd.read_csv(path)
districts

Unnamed: 0,KeyColumn,Name,Value
0,0,.,0
1,0,.,0
2,1,Kupwara,1
3,2,Badgam,2
4,3,Leh(Ladakh),3
...,...,...,...
669,728,Devbhoomi Dwarka,728
670,729,Gir Somnath,729
671,730,Mahisagar,730
672,731,Chota Udaipur,731


**What's in this dataset?** <a class="anchor" id="2"></a>

In [34]:
print("Shape:", districts.shape)
print("Rows:", districts.shape[0])
print("Columns:", districts.shape[1])
print("Each row is a district in India.")

Shape: (674, 3)
Rows: 674
Columns: 3
Each row is a district in India.


**Codebook** <a class="anchor" id="3"></a>

In [35]:
districts_columns = [column for column in districts.columns]
districts_description = ["Same as the Value column.",
                            "Name of District in India. There are 28 states in India and 8 union territories, all of which have territories within them. This column represents the names of those territories.",
                            "This value column has no real meaning, it is meant purely to count the districts."]
districts_dtypes = [dtype for dtype in districts.dtypes]

data = {"Column Name": districts_columns, "Description": districts_description, "Type": districts_dtypes}
districts_codebook = pd.DataFrame(data)
districts_codebook.style.set_properties(subset=['Description'], **{'width': '600px'})

Unnamed: 0,Column Name,Description,Type
0,KeyColumn,Same as the Value column.,int64
1,Name,"Name of District in India. There are 28 states in India and 8 union territories, all of which have territories within them. This column represents the names of those territories.",object
2,Value,"This value column has no real meaning, it is meant purely to count the districts.",int64


**Missing values** <a class="anchor" id="3.1"></a>

In [36]:
districts.isnull().sum()

KeyColumn    0
Name         0
Value        0
dtype: int64

**Summary statistics** <a class="anchor" id="3.2"></a>

In [37]:
districts.describe()

Unnamed: 0,KeyColumn,Value
count,674.0,674.0
mean,338.348665,338.348665
std,199.766956,199.766956
min,0.0,0.0
25%,167.25,167.25
50%,335.5,335.5
75%,503.75,503.75
max,732.0,732.0


## Columns
<a class="anchor" id="4"></a>

### Name
<a class="anchor" id="4.1"></a>
Name of District in India. There are 28 states in India and 8 union territories, all of which have territories within them. This column represents the names of those territories.

In [38]:
column = districts["Name"]
column

0                     .
1                     .
2               Kupwara
3                Badgam
4           Leh(Ladakh)
             ...       
669    Devbhoomi Dwarka
670         Gir Somnath
671           Mahisagar
672       Chota Udaipur
673             Palghar
Name: Name, Length: 674, dtype: object

In [39]:
print("No. of unique values:", len(column.unique()))

# Check for duplicates
counter = dict(Counter(column))
duplicates = { key:value for key, value in counter.items() if value > 1}
print("Duplicates:", duplicates)
if len(duplicates) > 0:
    print("No. of duplicates:", len(duplicates))

No. of unique values: 662
Duplicates: {'.': 2, 'Hamirpur': 2, 'Bilaspur': 2, 'North': 2, 'East': 2, 'West': 2, 'South': 2, 'Pratapgarh': 2, 'Balrampur': 2, 'Aurangabad': 2, 'Raigarh': 2, 'Bijapur': 2}
No. of duplicates: 12


In [40]:
districts.loc[districts['Name'].isin(duplicates.keys())].sort_values("Name")

Unnamed: 0,KeyColumn,Name,Value
0,0,.,0
1,0,.,0
516,515,Aurangabad,515
236,235,Aurangabad,235
183,182,Balrampur,182
657,716,Balrampur,716
418,417,Bijapur,417
558,557,Bijapur,557
407,406,Bilaspur,406
31,30,Bilaspur,30


### Value
<a class="anchor" id="4.2"></a>
This value column has no real meaning, it is meant purely to count the districts.

In [41]:
column = districts["Value"]
column

0        0
1        0
2        1
3        2
4        3
      ... 
669    728
670    729
671    730
672    731
673    732
Name: Value, Length: 674, dtype: int64

In [42]:
print("No. of unique values:", len(column.unique()))

# Check for duplicates
counter = dict(Counter(column))
duplicates = { key:value for key, value in counter.items() if value > 1}
print("Duplicates:", duplicates)
if len(duplicates) > 0:
    print("No. of duplicates:", len(duplicates))

No. of unique values: 672
Duplicates: {0: 2, 90: 2}
No. of duplicates: 2


In [43]:
districts.loc[districts['Value'].isin(duplicates.keys())]

Unnamed: 0,KeyColumn,Name,Value
0,0,.,0
1,0,.,0
91,90,North West,90
92,90,North,90
