### **This notebook preprocesses the *HealthCare.csv* file**
The two major tasks carried out are removing special characters from a string literal (district name and state name), and assigning numerical 0 to the numerical attributes with 'NA' value.
A trivial problem of leading and trailing whitespaces is also handled.

We start by importing the necessary packages - `csv` and `re` with the purpose of each described in the adjacent comment.

In [None]:
import csv  #to use csv.writer to write our preprocessed data into a .csv file
import re   # regular expressions package to remove special characters by using re.sub() function

Function to detect if a particular string can be converted to an integer or not.
The implementation is based on the concept of exception handling, i.e, if there is an exception while trying to convert the string to number, return `false`, else return `true`.

In [3]:
def isInteger(str):
	try:
		num = int(str)
	except ValueError:
		return False
	return True

Function to find the integer value of a certain string. If it is not convertable to an integer (which in our case would be for the attribute value **NA**), it would return 0.

In [4]:
def intValueOf(str):
	if isInteger(str):
		return int(str)
	else:
		return 0

Opening *HealthCare.csv* and storing all the rows in the list - `healthCareList`

In [5]:
with open('../data/HealthCare.csv', 'r') as healthCareCSV:
    healthCareList = list(healthCareCSV)

Initializing the new list of rows to be created - `reducedCSVList` with the heading names, and the boolean variable to skip the first row `headingRow`

In [None]:
headingRow = True
reducedCSVList = []
reducedCSVList.append(['State', 'District', 'Number of Sub Centres', 'Number of Primary Health Centres', 'Number of Community Health Centres', 'Sub Divisional Hospitals', 'District Hospitals'])

Iterating through the list of rows in `healthCareList` and splitting by '**,**' to get a list of elements in a particular row in the form of a list - `rowList`.
Getting rid of leading and trailing whitespaces by `strip()` function and getting rid of unnecessary special characters by `re.sub()` function.
The supposedly numerical attributes which were in String format were converted to integers by using the helper function `intValueOf()`, which takes care of converting *NA* to *0*.
The new values are then appended to the new list - `reducedCSVList`

In [9]:
for row in healthCareList:
    #Skipping first row
    if headingRow:
        headingRow = False
        continue
        
    rowList = row.split(",") #List of elements of a row
    
    #Integer value of the numerical attributes. Converting NA to 0 as well.
    noOfSubCentres = intValueOf(rowList[2].strip())
    noOfPrimaryCentres = intValueOf(rowList[3].strip())
    noOfCommunityCentres = intValueOf(rowList[4].strip())
    noOfSubDivisional = intValueOf(rowList[5].strip())
    noOfDistrict = intValueOf(rowList[6].strip())
    
    # removing special characters
    stateClean = re.sub('\W+',' ', rowList[0])
    districtClean = re.sub('\W+',' ', rowList[1])

    #appending the new values to a list
    reducedCSVList.append([stateClean, districtClean, noOfSubCentres, noOfPrimaryCentres, noOfCommunityCentres, noOfSubDivisional, noOfDistrict])


Writing the new list created into a new *.csv* file *healthCleanReduced.csv*

In [8]:
with open('../data/healthCleanReduced.csv', 'w') as healthCleanReduced:
	writer = csv.writer(healthCleanReduced)
	writer.writerows(reducedCSVList)